Researchers from UC Berkeley, the University of Warsaw and Stanford have developed a new technique called Embodied Chain-of-Thought (ECoT) reasoning to enhance the decision-making capabilities of vision-language-action (VLA) models used in robotic control systems.
Key Takeaways: ECoT enables robots to reason about their actions in a way that is grounded in their perception of the environment, combining semantic reasoning about tasks with “embodied” reasoning about the robot’s state and surroundings:
- By generating intermediate reasoning steps, ECoT allows VLAs to better map the relationships between different parts of a problem and come up with more accurate solutions, similar to how Chain-of-Thought (CoT) prompting has improved the performance of large language models (LLMs).
- ECoT goes beyond just breaking down tasks into sub-tasks, as is common in CoT for LLMs, by also requiring the model to reason about the environment, spatial relationships, and how the robot’s available actions can help achieve the goal.
Overcoming challenges in applying CoT to robotics: Directly applying CoT techniques used in LLMs to VLAs posed several challenges that the researchers had to address:
- Current VLAs rely on relatively smaller, open-source vision-language models (VLMs) that are not as proficient at reasoning as the larger LLMs used in language applications.
- Robotic tasks require the model to reason not only about the task itself but also about the environment and the robot’s own state, necessitating a more “embodied” form of reasoning.
Generating synthetic training data for ECoT: To enable VLA models to perform ECoT reasoning, the researchers created a pipeline to generate annotated training data:
- Pre-trained object detectors, LLMs, and VLMs are used to annotate existing robot datasets with information that can be used for reasoning, such as object bounding boxes and spatial relationships.
- Google’s Gemini model is then employed to generate the final reasoning chain, which includes rephrasing the instruction, outlining sub-tasks, identifying the current focus based on the environment and robot state, and predicting pixel locations of key elements.
Impressive performance gains and improved interpretability: Evaluating ECoT on a robotic manipulation setup using OpenVLA yielded significant improvements:
- ECoT increased the task success rate by 28% compared to the baseline model, demonstrating strong generalization to new objects, scenes, viewpoints and instructions not present in the training data.
- Expressing the reasoning steps in natural language made it much easier to understand why the model failed in certain situations, enabling humans to provide feedback and correct the policy’s behavior more effectively.
Broader implications for foundation models in robotics: ECoT is part of a growing trend of integrating foundation models, such as LLMs and VLMs, into robotic control systems to fill in gaps and enhance capabilities:
- Foundation models’ ability to ingest large amounts of unlabeled data from the internet allows them to contribute to various parts of the robotics stack, from designing reward functions to reasoning about the environment and planning actions.
- As the industry moves toward foundation models optimized for robotics, techniques like ECoT will play a crucial role in enabling more robust, interpretable, and generalizable robot control policies.
Researchers develop technique to give robots “embodied reasoning” abilities