CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

CCDP: Composition of Conditional Diffusion Policies
with Guided Sampling

📄 Paper 💻 Code (Comming Soon)

Key Points

Built on the Diffusion Policy (Chi et al., 2024.)
A low-level controller exploits the demonstration set in a more controlled way
Achieves failure recovery by avoiding previously failed attempts.

Requires only successful demonstrations.
Does not necessarily require data annotation.

Abstract

Imitation Learning offers a promising approach in robotics by enabling systems to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful demonstrations, our method can infer recovery actions without the need for additional exploratory behavior or a high-level controller. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem—which may require long-horizon history to manage failures—into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines.

Motivation

The demonstration set typically includes various ways to perform a task. When a task fails, we want the robot to avoid getting stuck and instead try alternative variations that have not yet failed. Unlike other failure recovery policies, our approach does not require a separate demonstration set with recovery policies or any exploratory behavior that usually demands access to a simulated environment. Instead, it offers a simplified recovery strategy that makes no assumptions about the underlying cause of failure—only that previous attempts were unsuccessful.

After training, the model can be integrated with others through the composition of diffusion models. Moreover, when multiple failure cases occur, our method can combine them by selecting samples that avoid all failed actions. Composing models to learn the recovery policy enables us to develop a single, versatile model capable of handling arbitrary sequences of failures while reducing dimensionality and facilitating learning.

Offline Phase

During the Offline Phase, the process unfolds as follows:

Action Samplers: We first train multiple samplers to generate actions from various distributions: an unconditional action sampler, a state-conditioned action sampler, and a history-conditioned action sampler.
Sampling Actions: We traverse the observed states in the demonstration set and sample a set of actions by combining the unconditional and state-dependent samplers. Excluding history at this stage enables exploration of a broader range of possibilities.
Identifying Recovery Candidates: For states that are sufficiently similar, we compute their pairwise distances in a predefined space. If the distance exceeds a specified threshold, these states are considered as potential recoveries for each other.
Learning the Avoidance-Conditioned Sampler: Finally, we use the new dataset to train an avoidance-conditioned action sampler.

The difference of each sampler can be described better as following:

Unconditional

State-Conditioned

History-Conditioned

Failure-Conditioned

Combined

Online Phase

We unify all models using the approach described by Liu et al. (2022).
The combination is managed by adjusting the weights of each model based on their specific purpose.
The combined model leverages the Failure-Conditioned model depending on the number of failed attempts.

Experiments

Door Opening (DO)

The robot is unaware of the door's opening direction.
It tests various approaches, avoiding previously failed attempts until success.
A single, unified policy—without a higher-level controller—enables the door to be opened, whether by pulling, moving up, or sliding to the side.

Buttom Pressing (BP)

The robot lacks knowledge of the button's location.

Object Manipulation (OM)

The object's mass is hidden from the robot.
Heavier objects render some manipulation primitives ineffective.
The robot compensates by executing less optimal actions to move the object.
A single policy controls both robots and selects actions without a bi-level planner.

Object Packing (OP) [The video is 6x speed]

Results

Paper

https://arxiv.org/abs/2503.15386.
CCDP: Composition of Conditional Diffusion Policies with Guided Sampling
Amirreza Razmjoo, Sylvain Calinon, Michael Gienger, Fan Zhang

Team

Amirreza Razmjoo

Sylvain Calinon

Michael Gienger

Fan Zhang

This webpage template was recycled from here.