×
GPT-4 Powered Robot Showcases Realistic Behavior, Hinting at Future of Robotics
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Alter3, a GPT-4 powered humanoid robot, showcases the potential of combining advanced language models with robotics to create more realistic and adaptable robot behaviors.

Harnessing the power of large language models: Alter3 leverages GPT-4’s vast knowledge to directly map natural language commands to robot actions, simplifying the process of controlling the robot’s 43 axes:

  • Researchers at the University of Tokyo and Alternative Machine have designed Alter3 to take advantage of GPT-4’s capabilities, enabling it to perform complex tasks like taking a selfie or mimicking a ghost.
  • GPT-4 acts as a planner, determining the steps required to perform the desired action, and then generates the necessary commands for the robot to execute each step using its in-context learning ability.

Refining actions through human feedback: Since language may not always precisely describe physical poses, Alter3 incorporates a feedback loop that allows humans to provide corrections, further improving the robot’s performance:

  • Users can provide feedback such as “Raise your arm a bit more,” which is sent to another GPT-4 agent that reasons over the code, makes necessary corrections, and returns the updated action sequence to the robot.
  • The refined action recipe and code are stored in a database for future use, enabling Alter3 to learn and adapt its behaviors over time.

Demonstrating emotional expression and realistic behaviors: GPT-4’s extensive knowledge about human behaviors and actions enables Alter3 to create more realistic behavior plans and even mimic emotions:

  • Experiments show that Alter3 can mimic emotions such as embarrassment and joy, even when emotional expressions are not explicitly stated in the text instructions.
  • GPT-4’s linguistic representations of movements can be accurately mapped onto Alter3’s body, resulting in more natural and human-like behaviors.

The growing trend of foundation models in robotics: Alter3 is part of a growing body of research that combines the power of foundation models with robotics systems:

  • Other projects, such as Figure, RT-2-X, and OpenVLA, also utilize foundation models as reasoning and planning modules in robotics control systems, showcasing the potential of this approach.
  • As multi-modality becomes the norm in foundation models, robotics systems will become better equipped to reason about their environment and choose their actions.

Analyzing deeper: While the integration of advanced language models like GPT-4 with robotics systems is a significant step forward, there are still challenges to be addressed:

  • Projects like Alter3 often overlook the base challenges of creating robots that can perform primitive tasks such as grasping objects, maintaining balance, and moving around.
  • Fine-tuned foundation models specifically designed for robotics commands, such as RT-2-X and OpenVLA, may produce more stable results and generalize better to various tasks and environments, but they require technical skills and are more expensive to create.
  • The lack of data for low-level robot tasks remains a significant hurdle in the development of more advanced and adaptable robotics systems.
Alter3 is the latest GPT-4-powered humanoid robot

Recent News

Nvidia’s new AI agents can search and summarize huge quantities of visual data

NVIDIA's new AI Blueprint combines computer vision and generative AI to enable efficient analysis of video and image content, with potential applications across industries and smart city initiatives.

How Boulder schools balance AI innovation with student data protection

Colorado school districts embrace AI in classrooms, focusing on ethical use and data privacy while preparing students for a tech-driven future.

Microsoft Copilot Vision nears launch — here’s what we know right now

Microsoft's new AI feature can analyze on-screen content, offering contextual assistance without the need for additional searches or explanations.