DeepMind, a unit of Google, has developed a new robotics transformer model called RT-2 that combines language and vision to control robots. The model is trained on images, text, and coordinate data of a robot’s movement in space. It can then generate both a plan of action and the coordinates necessary to complete a given command. The key insight of the research is representing robot actions as another language, allowing the model to output meaningful actions based on the input. The use of coordinates in the model is a significant milestone as it combines low-level programming with language and image neural nets.
RT-2 builds upon previous efforts by Google, including PaLI-X and PaLM-E, which are vision-language models. These models mix text and image data to develop the ability to relate the two, such as assigning captions to images or answering questions about them. RT-2 goes beyond its predecessors by not only generating a plan of action but also providing the coordinates of movement in space. It is based on large language models with billions of parameters, making it more proficient in performing tasks. The training of RT-2 incorporates image-text combinations and actions extracted from recorded robot data.
Once trained, RT-2 is tested by instructing the robot to perform various tasks using natural-language commands and images. The model generates a plan of action accompanied by coordinates to carry out the actions. It demonstrates the ability to generalize to real-world situations and interpret relations between objects, even if they were not provided in the robot demonstrations. In comparison to previous models, RT-2 using either PaLI-X or PaLM-E performs significantly better.
The development of RT-2 represents a significant step towards enabling real-time instruction of robots. By combining language, vision, and robot data, the model can understand and execute commands in a meaningful way. It has the potential to revolutionize the field of robotics and open up new possibilities for human-robot interaction.