Abstract

Robot learning of real-world manipulation tasks remains challenging and time consuming, even though actions are often simplified by single-step manipulation primitives. In order to compensate the removed time dependency, we additionally learn an image-to-image transition model that is able to predict a next state including its uncertainty. We apply this approach to bin picking, the task of emptying a bin using grasping as well as pre-grasping manipulation as fast as possible. The transition model is trained with up to 42000 pairs of real-world images before and after a manipulation action. Our approach enables two important skills: First, for applications with flange-mounted cameras, picks per hours (PPH) can be increased by skipping image measurements by around 15%. Second, we use the model to plan action sequences ahead of time and optimize time-dependent rewards, e.g. to minimize the number of actions required to empty the bin. We evaluate both improvements with real-robot experiments and set a new state-of-the-art result in the Box and Blocks Test with 702±14 PPH.

Conference Video


Supplementary Material

Below, we present supplementary material for our publication submitted to ICRA 2021. This includes more visual examples of the transition model (as predicted depth images with corresponding uncertainties), more videos and more detailed experimental results. We follow the structure of the experimental evaluation in the paper and the conference video above.

1. Example Predictions

To recap the paper's introduction, we focus on the task of bin picking using a flange-mounted depth camera. The robot should empty the bin using grasping and pre-grasping manipulation (like shifting and pushing) as fast as possible. We proposed and implemented a transition model, which is able to predict the depth image after a given action and its corresponding reward. It was trained on around 42000 pairs of real-world images before and after a manipulation action, using a dataset with three different object types.

1.1 Comparison BicycleGAN, Pix2Pix and VAE

We compare different neural network architectures for the usage of a transition model.

Before
After
BiCycle GAN
Pix2Pix
VAE

1.2 RGBD Predictions

The transition model is also able to precict complete RGBD images. However, we only use the depth image for downstream manipulation.

Before
After
BiCycle GAN

1.3 Generalization Predictions

We evaluate our transition model using objects that were never seen during training.

Before
After
BiCycle GAN

2. Iterative Prediction

By applying the transition model iteratively, the robot is then able to plan multiple actions based on a single image. Below, the predicted images of three examples are shown. We denote step 0 as the measured image with zero uncertainty; all following images are predictions. To each step, the measured reward after execution, the estimated reward and its propagated uncertainty before execution and the action type is given. We use three grasping actions with different pre-shaped gripper widths (type 0, 1, and 2), one shifting action (type 3) and a bin empty action (type 4). Within the depth images (top) and the uncertainty image (bottom), the gripper is sketched (white) with a bounding rectangle (red). The uncertainty is shown from low (blue) to high (red).

2.1 Simple Example

2.2 Complex Manipulation

2.3 Bin Picking

3. Grasp Rate

Let the grasp rate be the average percentage of successful grasps. We investigated the dependency of the grasp rate on the prediction step in a bin picking scenario. Measuring 1080 grasp attempts, we find that even after the third prediction step the grasp rate is still over 90%. A graphical evaluation is shown in the paper.

Detailed experimental results of each grasp (N = 1080) can be found here.

3. YCB Box and Blocks Benchmark

The YCB Box and Blocks Benchmark is about grasping and placing as many objects as possible into another bin within 2 minutes. The wooden cubes have a side length of 2.5cm; the gripper needs to be above the other bin before dropping. The robotic benchmark - in contrast to the original medical test - allows for grasping multiple objects at once. We evaluate different settings, partially where the robot has learned to grasp multiple objects using semantic grasping. Combining this with the transition model, we achieved up to 26 objects in 2 mins.

Fig. 1: Our setup for the YCB Box and Blocks Benchmark. Note that the Franka Panda robotic arm has a max. velocity of 1.7 m/s and max. acceleration of 13 m/s², leavingr room for improvement by using a faster robot.

Tab. 1: The summarized results of the YCB Box and Blocks Benchmark. We evaluated two different settings: First, single or multiple denotes if the robot was trained to grasp multiple objects at once via semantic grasping. Second, prediction uses the transition model to skip some image mesaurements. The random grasp rate denotes uniformly sampled grasps within the bounding box.

Method Objects Grasp Rate [%] Picks Per Hour (PPH) Video
Random 2.2 ± 0.7 13 ± 4 66 ± 20 1 Object 🔗
Single 12.8 ± 0.3 97 ± 2 384 ± 10 13 Objects 🔗
Single + Prediction 16.4 ± 0.5 94 ± 2 492 ± 14 15 Objects 🔗
Multiple 20.4 ± 1.0 96 ± 2 612 ± 29 20 Objects 🔗
Multiple + Prediction 23.4 ± 0.3 94 ± 2 702 ± 14 26 Objects 🔗

Detailed experimental results of each try can be found here.

5. Planning Ahead

We've implemented a breadth-first planning tree for optimizing multi-step cost functions. Below, we demonstrate the planning capability at a experiment, where the robot optimizes for fewest actions for emptying the bin. Videos and image predictions are shown for two typical approaches - using a different number of steps - for emptying the bin. By planning ahead, the robot more often starts by shifting a middle cube, thus reducing the average number of actions from 4.8 to 4.1 steps.

Fig. 2: For this example configuration of three cubes in a row, the robot is able to reduce the average number of actions by 0.7 steps by planning a few steps ahead.

5.1 Example for 4 actions

5.2 Example for 5 actions