Key Points

1. The paper discusses the development of a new dataset DROID, which comprises robotic manipulation demonstrations across various settings such as industrial offices, kitchens, dining rooms, home offices, bedrooms, bathrooms, living rooms, hallways, closets, and others. The dataset includes trajectories from 2,080 unique scenes, each containing stereo RGB camera streams, robot joint positions, velocities, end-effector pose, gripper position, and metadata including natural language instructions, camera calibration matrices, building names, and scene classifications using the GPT-4V API.

2. The labeling approach for unique scenes involves generating a scene ID each time the user indicates a robot or external camera movement. Duplicates are identified within groups based on robot serial number, lab collecting the data, and building name, and scenes that did not change sufficiently are removed to obtain a conservative estimate of unique scenes.

3. The dataset evaluation involves testing policies on six tasks with out-of-distribution variants, including placing chips on a plate, putting an apple in a pot, toasting, closing a waffle maker, cleaning up a desk, and cooking lentils in various settings. The evaluation tasks range from short to long horizon tasks and involve adding distractor objects for out-of-distribution variants.

4. The diffusion policy architecture and hyperparameters used in the policy training pipeline are described, where camera observations, language instructions, and robot state information are input into a ResNet-50 visual encoder and a UNet diffusion head to generate action trajectories. The observations are pre-processed, and hyperparameters such as observation horizon, action sequences, training steps, and co-training methods are outlined.

5. The training hyperparameters for the diffusion policy training include data mixing between in-domain demonstrations and DROID or OXE trajectories with different co-training methods. The distribution of skills and interacted objects in DROID is also visualized, showcasing a diverse range of verb classes and interacted objects in the dataset.

6. The dataset viewer included in the supplementary material allows browsing of the dataset videos, and the joint distribution of verbs and interacted objects in DROID is illustrated to demonstrate the diverse range of interactions performed on different objects in the dataset.

7. The paper discusses the limitations of the labeling approach, where grouping scenes based on robot serial number may result in scene duplication if different robots are placed at the same scene, and conservative estimates are emphasized.

8. The paper outlines the ways DROID is used for training batches for policy learning, such as No Co-training, DROID (Ours), and OXE, and highlights the differences in data usage and architectures while training policies for various evaluation tasks.

9. The policy training is conducted with specific attention to the Cook Lentils task, which involves 50000 training steps due to the task's increased complexity, and policies are evaluated based on their success in completing the evaluation tasks with different data mixing methods.

Summary

The paper introduces the DROID (Distributed Robot Interaction Dataset), a large-scale dataset for robot manipulation with 76k trajectories or 350 hours of interaction data collected across 564 scenes, 86 tasks, and 52 buildings over the course of 12 months. Each DROID episode contains three synchronized RGB camera streams, camera calibration, depth information, and natural language instructions. The paper demonstrates that training with DROID leads to policies with higher performance, greater robustness, and improved generalization ability compared to previous large-scale robot manipulation datasets.

The impact of training with the DROID dataset is discussed in the paper, showing that it can significantly improve policy performance and robustness across a wide spectrum of robot manipulation tasks and environments. The availability of the dataset, pre-trained model checkpoints, and a reproduction guide is highlighted in the paper. The authors provide an open-source distribution of the full dataset, pre-trained model checkpoints, and a detailed guide for reproducing the robot hardware setup. The research also addresses the importance of large, diverse, high-quality robot manipulation datasets in the development of more capable and robust robotic manipulation policies.

Furthermore, the paper investigates how the scene diversity in DROID impacts policy robustness. The experiment demonstrates that co-training on a subset of DROID with diverse scenes results in better performance in the out-of-distribution evaluation setting, indicating the importance of scene diversity in the dataset.
Overall, the paper presents DROID as a valuable resource for research on general-purpose robot manipulation policies and highlights its potential in accelerating research in the field. The detailed analysis and experimentation conducted with the DROID dataset demonstrate its effectiveness in boosting policy performance and robustness, emphasizing its significance in the advancement of robot learning research.

Description of DROID Dataset
The paper presents the Distributed Robot Interaction Dataset (DROID), which is a comprehensive dataset designed for robot learning and interaction tasks. The dataset consists of 15Hz recorded trajectory data, encompassing multiple elements such as three stereo RGB camera streams at 1280x720 resolution, robot joint positions and velocities, end-effector pose and velocity, and gripper position and velocity. Furthermore, each trajectory incorporates natural language instructions, extrinsic camera calibration matrices, building names, data collector user IDs, and scene types labeled using the GPT4V API. The unique aspect of DROID lies in its inclusion of scenes in various settings such as industrial office, kitchen, dining room, home office, kitchen, dining room, bedroom, bathroom, living room, hallway/closet, and others. Moreover, 2,080 unique scenes are identified, with the labeling approach taking into account changes in the robot's workspace to define a unique scene.

Evaluation Tasks in DROID
The dataset is complemented with six evaluation tasks, each featuring out-of-distribution variants to ensure robust policy learning. These tasks include placing chips on a plate, putting an apple in a pot, toasting, closing a waffle maker, cleaning up a desk, and cooking lentils, all exhibiting the diverse range of real-world interaction scenarios.

Data Labeling and Policy Architecture
The research team meticulously managed the data labeling process to ensure accurate categorization of scenes and removal of duplicates, thereby striving to establish a conservative estimate of unique scenes. Additionally, the paper discusses policy architecture and hyperparameters employed in training diffusion policies, emphasizing the utilization of the Robomimic codebase. Policies are trained using various methods of constructing training batches, including no co-training, DROID, and OXE, and are evaluated based on the tasks described earlier.

Importance of DROID Dataset
The DROID dataset proves invaluable for advancing robot learning and interaction capabilities, providing a rich resource for training and evaluating policy learning algorithms in diverse real-world scenarios. The availability of the dataset, pre-trained model checkpoints, and a reproduction guide further ensure its broader impact on the research community and the advancement of robotic technology.

Reference: https://arxiv.org/abs/2403.12945