Affordance-based Robot Manipulation
with Flow Matching
We present a framework for assistive robot manipulation that addresses two fundamental challenges: efficient adaptation of large-scale models for scene affordance understanding and effective learning of robot actions by grounding the visual affordance. To tackle the first challenge, we adopt a parameter-efficient prompt tuning method, prepending learnable text prompts to a frozen vision model to predict affordances, while considering spatial and semantic relationships in multi-task scenarios. For the second challenge, we propose a flow matching method, representing a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot actions. We introduce a real-world dataset with 10 tasks to evaluate our approach. Experiments show our prompt tuning method achieves competitive or superior performance to other finetuning protocols across data scales, while satisfying parameter efficiency. Flow matching yields more stable training and faster inference, while maintaining comparable generalization performance to diffusion policy. Our framework seamlessly unifies parameter-efficient affordance learning and robot action generation with flow matching.
Highlights
Real-world Experiments (affordance-based VLA with flow Matching)
Affordance-based VLA with flow Matching has been tested on tasks across Activities of Daily Living, and leads to consistently better performance than alternative behavior cloning methods. (Videos are 4x speed)
Closed-loop long horizon manipulation with flow matching
Paper
2409.01083 [cs.RO].
Affordance-based Robot Manipulation with Flow Matching
Fan Zhang, Michael Gienger
Code is here https://github.com/HRI-EU/flow_matching