ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation

1Robotics Institute, Carnegie Mellon University 2IIIS, Tsinghua University 3Department of Computer Science, National Yang Ming Chiao Tung University
†*Equal Contribution, Equal Advising

Abstract

This paper presents ArticuBot, in which a single learned policy enables a robotics system to open diverse categories of unseen articulated objects in the real world. This task has long been challenging for robotics due to the large variations in the geometry, size, and articulation types of such objects. Our system, ArticuBot, consists of three parts: generating a large number of demonstrations in physics-based simulation, distilling all generated demonstrations into a point cloud-based neural policy via imitation learning, and performing zero-shot sim2real transfer to real robotics systems. Utilizing sampling-based grasping and motion planning, our demonstration generalization pipeline is fast and effective, generating a total of 42.3k demonstrations over 322 training articulated objects. For policy learning, we propose a novel hierarchical policy representation, in which the high-level policy learns the sub-goal for the end-effector, and the low-level policy learns how to move the end-effector conditioned on the predicted goal. We demonstrate that this hierarchical approach achieves much better object-level generalization compared to the non-hierarchical version. We further propose a novel weighted displacement model for the high-level policy that grounds the prediction into the existing 3D structure of the scene, outperforming alternative policy representations. We show that our learned policy can zero-shot transfer to three different real robot settings: a fixed table-top Franka arm across two different labs, and an X-Arm on a mobile base, opening multiple unseen articulated objects across two labs, real lounges, and kitchens.

ArticuBot System Overview


Top: Overview of ArticuBot. We combine sampling-based grasping, motion planning, and opening actions to efficiently generate thousands of demonstrations in simulation. These demonstrations are distilled into a hierarchical policy via imitation learning, and then zero-shot transferred to real world.
Middle: We propose a weighted displacement model for the high-level policy, which predicts the sub-goal end-effector pose. The weighted displacement model predicts the displacement from each point in the point cloud observation to the sub-goal end-effector, as well as a weight for each point. The final prediction is the weighted average of each point's prediction.
Bottom: We propose a goal-conditioned 3D diffusion policy for the low-level policy, which first applies attention between the current end-effector points, the scene points, and the goal end-effector points to obtain a latent embedding, and then performs diffusion on the latent embedding to generate the action, which is the delta transformation of the robot end-effector.

Real Lounge/Kitchen/Offices


We test ArticuBot in loungs/kitchen/offices on 7 unseen real-world articulated objects (a single policy) with a X-Arm on a mobile base. We use 2 Azure Kinects as the cameras. We test with the original X-Arm gripper, and also the UMI gripper. Videos are 5x speed.



Drawer

Fridge

cabinet

Cabinet wtih thin handle

Drawer 2

Drawer 3

Dishwasher (early stop due to excessive force)

Real-World Lab A Test objects


We also test ArticuBot (the same policy) on 9 unseen real-world articulated objects with a table-top Franka Arm, and two Azure Kinects cameras in Lab A. Videos are 5x speed.



3-layer cabinet

knob cabinet

2-layer drawer

red microwave

white cabinet

black microwave

toy fridge

toy oven

knob drawer

Real World Lab B Test objects


We also test ArticuBot (the same policy as used in above experiments) on 4 unseen real-world articulated objects with a table-top Franka Arm, and two RealSense D435i cameras in Lab B. Videos are 5x speed.



Green cabinet

Green cabinet (flipped)

Brown storage box

Red strorage box

Drawer (bottom)

Drawer (top)

Visualizations of the learned policy outputs

Please click an image below to view the visualizations of the policy outputs.


Store an item Heat soup Slide Window Gallop Rotate in place Retrieve from safe Unload Cart Unload Cart

Comparison to prior methods in the real world

We compare to FlowBot3D and AO-Grasp on 9 objects in lab A. As AO-Grasp only performs grasping, we compare against it in terms of grasping success rate. As FlowBot3D originally uses a suction gripper for grasping, when comparing to FlowBot3D, we manually move the parallel jaw gripper to first grasp the handle of the object, then apply FlowBot3D to open it. We compare against it in terms of normalized opening performance. We also compare to OpenVLA, which we find to fail to grasp or open any of the test objects.



Example Image

Comparison to prior articulated object manipulation methods


Comparison to non-hierarchical policy

We compare ArticuBot, which uses a hierarchical policy, to a non-hierarchical policy, that learns to output eef end-effector delta transformations without conditioned on a learned predicted eef goal. We compare in terms of the normalized opening performance, when trained with different number of objects in simulation. As shown, using a hierarchical policy significantly outperforms the non-hierarchical policy. Increasing the number of training objects do not bring much improvement for the non-hierarchical policy.



Example Image

Comparison of hierarchical and non-hierarchical policy.


Comparison of different high-level policy architecture

We compare the weighted displacement model, which is the high-level policy used by ArticuBot, to several other methods for predicting the goal eef points: DP3 (wtih the default UNet diffusion), DP3 (using a transformer for diffusion), and 3D Diffuser Actor (3DDA). These baselines directly diffuse the goal eef points, instead of predicting the displacement from the scene points to the goal eef points. The proposed weighted displacement model outperforms these baselines especially when tested with unseen camera randomizations, which is crucial for sim2real transfer.



Example Image

Comparison of different high-level policies. Leftmost: Train and test without camera randomizations. Right: Train with camera randomizations, and test with no camera randomization, with camera randomizations from training distribution, and with camera randomizations from an unseen test distribution.


BibTeX

@article{wang2025articubot,
      title={ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation}, 
      author={Wang, Yufei and Wang, Ziyu and  Nakura, Mino and  Bhowal, Pratik and Kuo, Chia-liang and Chen , Yi-Ting and Erickson, Zackory and Held, David},
      journal={arXiv preprint arXiv:2503.03045},
      year={2024}
}