Vision-Language Models for Dense Feedback Reward

Abstract

We introduce an enhanced Learning from Demonstration (LfD) algorithm that integrates a video-language model (VLM) with an actor-critic architecture to leverage natural language (NL) feedback in robotic training. We compare our method against traditional TAMER, RoboCLIP, and modified RoboCLIP baselines in a simulated Metaworld environment, focusing on tasks like opening and closing drawers and pressing buttons. Our experiments demonstrate that while our methodology shows improved sample efficiency and performance over RoboCLIP, it faces challenges with feedback variability from the VLM. Additionally, a human subject study assesses user comfort with NL feedback versus scalar feedback, revealing a preference for TAMER-style feedback despite the intuitive appeal of NL. The findings suggest that while NL feedback enhances interaction richness, its implementation needs careful consideration to minimize variability and optimize learning outcomes.

Motivation

Situated Learning Interaction - A tight feedback loop for faster and effective learning
Use feedback from an observing expert (TAMER)
How do we make the best use of the range {-s, s}
Approach: Natural language as medium for feedback for nuance and expressivity

Can we use natural language for reward modeling?

Circumvent the need for designing extrinsic reward functions
Added expressivity that comes with communicating via language

Credits: TAMER: Training an Agent Manually via Evaluative Reinforcement

Prior Works

LLMs for Reward Modeling
Lack Visual Grounding

VLMs for Reward Modeling
Visually grounded but Natural Language as Feedback for Reward Not Explored

Our Approach

Dense RL with Natural Language Feedback

Task Setup

Metaworld Manipulation Tasks

Task 1: Pick and Place
Task 2: Button Press
Task 3: Door Opening

Feedback given and credit assignment at 32-step interval
Metrics: Avg. Env-based reward, Success Rate

Results

Qualitative

Ours

RoboCLIP (32 time steps)

TAMER

RoboCLIP

Quantitative

Interpolate start reference image.

Comparable performance to TAMER

1300x sample efficient compared to RoboCLIP

Although failed the Close Door task, our approach got a slightly better reward than RoboCLIP

Avg. Evaluation Reward on a) Close Drawer, b) Button Press, and c) Close Door. Primary Scale at the bottom and Secondary scale at the top.

Failure Scenarios

Failure Scenario GIF

Failure scenario: Reward score LLaVA =0.6 S3D=15.7

Success scenario: Reward score LLaVA =40 S3D=-4.5

Better Scene Perspective? - Multi-view

Larger VLMs can help

User Study

Do humans prefer a natural language form of communication compared to a scalar value feedback?

Conducted a user-study that can be broken down into 2 parts:

Pre-study questionnaire
- Demographic
- Personality (mini-IPIP)
- Attitude towards robots (NARS)
Post-study questionnaire
- Workload Assessment (NASA-TLX)
- System Usability (System Usability Scale)
- Perceived Intelligence (Godspeed Scale)