Flexible pick-and-place is a fundamental yet challenging task within robotics, in particular due to the need of an object model for a simple target pose definition. In this work, the robot instead learns to pick-and-place objects using planar manipulation according to a single, demonstrated goal state. Our primary contribution lies within combining robot learning of primitives, commonly estimated by fully-convolutional neural networks, with one-shot imitation learning. Therefore, we define the place reward as a contrastive loss between real-world measurements and a task-specific noise distribution. Furthermore, we design our system to learn in a self-supervised manner, enabling real-world experiments with up to 25000 pick-and-place actions. Then, our robot is able to place trained objects with an average placement error of 2.7±0.2 mm and 2.6±0.8°. As our approach does not require an object model, the robot is able to generalize to unknown objects keeping a precision of 5.9±1.1 mm and 4.1±1.2°. We further show a range of emerging behaviors: The robot naturally learns to select the correct object in the presence of multiple object types, precisely inserts objects within a peg game, picks screws out of dense clutter, and infers multiple pick-and-place actions from a single goal state.
Below, we show supplementary videos of our pick-and-place system. As our approach places objects according to a demonstrated goal state, it does not require an object model. We've trained two models: First, a model using RGBD-images handling screws on around 3500 pick-and-place actions. Second, a general model using depth-images trained while manipulating wooden objects with around 25000 pick-and-place actions. It is used for all further experiments without screws.
As no object model is needed, our system is able to pick-and-place even unknown objects with high precision.
To further demonstrate the precision of our system, we evaluate insertion tasks with small tolerances. The robot achieves success rates - depending on the object type - of up to 90% despite grasping out of clutter.
Our robot is able to infer multiple pick-and-place actions from a single goal state.
Here we show some samples of successful pick-and-place actions. Our approach uses four images, in particular of the window around the robot's tool center point (TCP), as its visual state space.