×
Breakthrough Technique Enables Smarter, More Interpretable Robot Decision-Making
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Researchers from UC Berkeley, the University of Warsaw and Stanford have developed a new technique called Embodied Chain-of-Thought (ECoT) reasoning to enhance the decision-making capabilities of vision-language-action (VLA) models used in robotic control systems.

Key Takeaways: ECoT enables robots to reason about their actions in a way that is grounded in their perception of the environment, combining semantic reasoning about tasks with “embodied” reasoning about the robot’s state and surroundings:

  • By generating intermediate reasoning steps, ECoT allows VLAs to better map the relationships between different parts of a problem and come up with more accurate solutions, similar to how Chain-of-Thought (CoT) prompting has improved the performance of large language models (LLMs).
  • ECoT goes beyond just breaking down tasks into sub-tasks, as is common in CoT for LLMs, by also requiring the model to reason about the environment, spatial relationships, and how the robot’s available actions can help achieve the goal.

Overcoming challenges in applying CoT to robotics: Directly applying CoT techniques used in LLMs to VLAs posed several challenges that the researchers had to address:

  • Current VLAs rely on relatively smaller, open-source vision-language models (VLMs) that are not as proficient at reasoning as the larger LLMs used in language applications.
  • Robotic tasks require the model to reason not only about the task itself but also about the environment and the robot’s own state, necessitating a more “embodied” form of reasoning.

Generating synthetic training data for ECoT: To enable VLA models to perform ECoT reasoning, the researchers created a pipeline to generate annotated training data:

  • Pre-trained object detectors, LLMs, and VLMs are used to annotate existing robot datasets with information that can be used for reasoning, such as object bounding boxes and spatial relationships.
  • Google’s Gemini model is then employed to generate the final reasoning chain, which includes rephrasing the instruction, outlining sub-tasks, identifying the current focus based on the environment and robot state, and predicting pixel locations of key elements.

Impressive performance gains and improved interpretability: Evaluating ECoT on a robotic manipulation setup using OpenVLA yielded significant improvements:

  • ECoT increased the task success rate by 28% compared to the baseline model, demonstrating strong generalization to new objects, scenes, viewpoints and instructions not present in the training data.
  • Expressing the reasoning steps in natural language made it much easier to understand why the model failed in certain situations, enabling humans to provide feedback and correct the policy’s behavior more effectively.

Broader implications for foundation models in robotics: ECoT is part of a growing trend of integrating foundation models, such as LLMs and VLMs, into robotic control systems to fill in gaps and enhance capabilities:

  • Foundation models’ ability to ingest large amounts of unlabeled data from the internet allows them to contribute to various parts of the robotics stack, from designing reward functions to reasoning about the environment and planning actions.
  • As the industry moves toward foundation models optimized for robotics, techniques like ECoT will play a crucial role in enabling more robust, interpretable, and generalizable robot control policies.
Researchers develop technique to give robots “embodied reasoning” abilities

Recent News

AI agents and the rise of Hybrid Organizations

Meta makes its improved AI image generator free to use while adding visible watermarks and daily limits to prevent misuse.

Adobe partnership brings AI creativity tools to Box’s content management platform

Box users can now access Adobe's AI-powered editing tools directly within their secure storage environment, eliminating the need to download files or switch between platforms.

Nvidia’s new ACE platform aims to bring more AI to games, but not everyone’s sold

Gaming companies are racing to integrate AI features into mainstream titles, but high hardware requirements and artificial interactions may limit near-term adoption.