Abstract
We introduce an enhanced Learning from Demonstration (LfD) algorithm
that integrates a video-language model (VLM) with an actor-critic architecture to leverage natural language (NL) feedback in robotic training.
We compare our method against traditional TAMER, RoboCLIP, and modified RoboCLIP baselines in a simulated Metaworld environment,
focusing on tasks like opening and closing drawers and pressing buttons.
Our experiments demonstrate that while our methodology shows improved sample efficiency and performance over RoboCLIP,
it faces challenges with feedback variability from the VLM. Additionally, a human subject study assesses user comfort with NL feedback versus scalar feedback,
revealing a preference for TAMER-style feedback despite the intuitive appeal of NL.
The findings suggest that while NL feedback enhances interaction richness,
its implementation needs careful consideration to minimize variability
and optimize learning outcomes.