Robot learning of real-world manipulation tasks remains challenging and time consuming, even though actions are often simplified by single-step manipulation primitives. In order to compensate the removed time dependency, we additionally learn an image-to-image transition model that is able to predict a next state including its uncertainty. We apply this approach to bin picking, the task of emptying a bin using grasping as well as pre-grasping manipulation as fast as possible. The transition model is trained with up to 42000 pairs of real-world images before and after a manipulation action. Our approach enables two important skills: First, for applications with flange-mounted cameras, picks per hours (PPH) can be increased by skipping image measurements by around 15%. Second, we use the model to plan action sequences ahead of time and optimize time-dependent rewards, e.g. to minimize the number of actions required to empty the bin. We evaluate both improvements with real-robot experiments and set a new state-of-the-art result in the Box and Blocks Test with 702±14 PPH.
Below, we present supplementary material for our publication submitted to ICRA 2021. This includes more visual examples of the transition model (as predicted depth images with corresponding uncertainties), more videos and more detailed experimental results. We follow the structure of the experimental evaluation in the paper and the conference video above.
To recap the paper's introduction, we focus on the task of bin picking using a flange-mounted depth camera. The robot should empty the bin using grasping and pre-grasping manipulation (like shifting and pushing) as fast as possible. We proposed and implemented a transition model, which is able to predict the depth image after a given action and its corresponding reward. It was trained on around 42000 pairs of real-world images before and after a manipulation action, using a dataset with three different object types.
We compare different neural network architectures for the usage of a transition model.
The transition model is also able to precict complete RGBD images. However, we only use the depth image for downstream manipulation.
We evaluate our transition model using objects that were never seen during training.
By applying the transition model iteratively, the robot is then able to plan multiple actions based on a single image. Below, the predicted images of three examples are shown. We denote step 0 as the measured image with zero uncertainty; all following images are predictions. To each step, the measured reward after execution, the estimated reward and its propagated uncertainty before execution and the action type is given. We use three grasping actions with different pre-shaped gripper widths (type 0, 1, and 2), one shifting action (type 3) and a bin empty action (type 4). Within the depth images (top) and the uncertainty image (bottom), the gripper is sketched (white) with a bounding rectangle (red). The uncertainty is shown from low (blue) to high (red).
Let the grasp rate be the average percentage of successful grasps. We investigated the dependency of the grasp rate on the prediction step in a bin picking scenario. Measuring 1080 grasp attempts, we find that even after the third prediction step the grasp rate is still over 90%. A graphical evaluation is shown in the paper.
Detailed experimental results of each grasp (N = 1080) can be found here.
The YCB Box and Blocks Benchmark is about grasping and placing as many objects as possible into another bin within 2 minutes. The wooden cubes have a side length of 2.5cm; the gripper needs to be above the other bin before dropping. The robotic benchmark - in contrast to the original medical test - allows for grasping multiple objects at once. We evaluate different settings, partially where the robot has learned to grasp multiple objects using semantic grasping. Combining this with the transition model, we achieved up to 26 objects in 2 mins.
Fig. 1: Our setup for the YCB Box and Blocks Benchmark. Note that the Franka Panda robotic arm has a max. velocity of 1.7 m/s and max. acceleration of 13 m/s², leavingr room for improvement by using a faster robot.
Tab. 1: The summarized results of the YCB Box and Blocks Benchmark. We evaluated two different settings: First, single or multiple denotes if the robot was trained to grasp multiple objects at once via semantic grasping. Second, prediction uses the transition model to skip some image mesaurements. The random grasp rate denotes uniformly sampled grasps within the bounding box.
Method | Objects | Grasp Rate [%] | Picks Per Hour (PPH) | Video |
---|---|---|---|---|
Random | 2.2 ± 0.7 | 13 ± 4 | 66 ± 20 | 1 Object 🔗 |
Single | 12.8 ± 0.3 | 97 ± 2 | 384 ± 10 | 13 Objects 🔗 |
Single + Prediction | 16.4 ± 0.5 | 94 ± 2 | 492 ± 14 | 15 Objects 🔗 |
Multiple | 20.4 ± 1.0 | 96 ± 2 | 612 ± 29 | 20 Objects 🔗 |
Multiple + Prediction | 23.4 ± 0.3 | 94 ± 2 | 702 ± 14 | 26 Objects 🔗 |
Detailed experimental results of each try can be found here.
We've implemented a breadth-first planning tree for optimizing multi-step cost functions. Below, we demonstrate the planning capability at a experiment, where the robot optimizes for fewest actions for emptying the bin. Videos and image predictions are shown for two typical approaches - using a different number of steps - for emptying the bin. By planning ahead, the robot more often starts by shifting a middle cube, thus reducing the average number of actions from 4.8 to 4.1 steps.
Fig. 2: For this example configuration of three cubes in a row, the robot is able to reduce the average number of actions by 0.7 steps by planning a few steps ahead.