Vision-Language Models for Dense Feedback Reward

Georgia Institute of Technology
*Equal Contribution

Abstract

We introduce an enhanced Learning from Demonstration (LfD) algorithm that integrates a video-language model (VLM) with an actor-critic architecture to leverage natural language (NL) feedback in robotic training. We compare our method against traditional TAMER, RoboCLIP, and modified RoboCLIP baselines in a simulated Metaworld environment, focusing on tasks like opening and closing drawers and pressing buttons. Our experiments demonstrate that while our methodology shows improved sample efficiency and performance over RoboCLIP, it faces challenges with feedback variability from the VLM. Additionally, a human subject study assesses user comfort with NL feedback versus scalar feedback, revealing a preference for TAMER-style feedback despite the intuitive appeal of NL. The findings suggest that while NL feedback enhances interaction richness, its implementation needs careful consideration to minimize variability and optimize learning outcomes.

Motivation

  • Situated Learning Interaction - A tight feedback loop for faster and effective learning
  • Use feedback from an observing expert (TAMER)
  • How do we make the best use of the range {-s, s}
  • Approach: Natural language as medium for feedback for nuance and expressivity

Can we use natural language for reward modeling?

  • Circumvent the need for designing extrinsic reward functions
  • Added expressivity that comes with communicating via language

Motivation Image
Credits: TAMER: Training an Agent Manually via Evaluative Reinforcement

Prior Works

Prior LLMs
LLMs for Reward Modeling
Lack Visual Grounding
Prior VLMs
VLMs for Reward Modeling
Visually grounded but Natural Language as Feedback for Reward Not Explored

Our Approach

Interpolate start reference image.
Dense RL with Natural Language Feedback

Task Setup

  • Metaworld Manipulation Tasks
    • Task 1: Pick and Place
    • Task 2: Button Press
    • Task 3: Door Opening
  • Feedback given and credit assignment at 32-step interval​
  • Metrics: Avg. Env-based reward, Success Rate

Results

Qualitative

Ours GIF
Ours
RoboCLIP (32 time steps) GIF
RoboCLIP (32 time steps)
TAMER GIF
TAMER
RoboCLIP GIF
RoboCLIP

Quantitative

Interpolate start reference image.

  • Comparable performance to TAMER
  • 1300x sample efficient compared to RoboCLIP
  • Although failed the Close Door task, our approach got a slightly better reward than RoboCLIP​


  • Interpolate start reference image. Interpolate start reference image.
    Interpolate start reference image.
    Avg. Evaluation Reward on a) Close Drawer, b) Button Press, and c) Close Door. Primary Scale at the bottom and Secondary scale at the top.

    Failure Scenarios

    Failure scenario GIF
    Failure Scenario GIF
    Failure scenario image 1
    Failure scenario: Reward score LLaVA =0.6  S3D=15.7 
    Failure scenario image 2
    Success scenario: Reward score LLaVA =40 S3D=-4.5  

  • Better Scene Perspective? - Multi-view
  • Larger VLMs can help
  • User Study

    Do humans prefer a natural language form of communication compared to a scalar value feedback?

    Conducted a user-study that can be broken down into 2 parts:

    • Pre-study questionnaire
      • Demographic
      • Personality (mini-IPIP)
      • Attitude towards robots (NARS)
    • Post-study questionnaire
      • Workload Assessment (NASA-TLX)
      • System Usability (System Usability Scale)
      • Perceived Intelligence (Godspeed Scale)
    System Usability
    System Usability
    Perceived Intelligence
    Perceived Intelligence
    Workload Assessment
    Workload Assessment

    Conclusion

  • Plausible to use a VLM as proxy for feedback but more powerful models might perform better.
  • VLM’s POV can affect its interpretation of the scene.
  • Humans prefer to see the model learn “instantaneously” from their feedback.

  • Future Work

  • Query user automatically based on some reward heuristic
  • Explore larger VLMs (LLaVA, etc)
  • Multi-view perspective for better scene understanding
  • Allow for speech-based interface