type
status
date
slug
summary
tags
category
icon
password
Researchers at the University of Tokyo and Alternative Machine have developed a humanoid robotic system called Alter3 that can directly map natural language commands to robotic actions. By leveraging large language models (LLMs) such as GPT-4, Alter3 is able to perform complex tasks such as taking selfies or imitating ghosts.
- Before LLM, all 43 axes of the robot had to be manually controlled to imitate motion, which usually required a lot of manual fine-tuning. Now, with LLM, complex motion control can be achieved simply through natural language instructions without the need for an iterative learning process.
- Through verbal feedback, Alter3 is able to adjust the movement code according to the user’s verbal instructions and store the improved movements in the database to form an effective body pattern memory.
- 107 participants were recruited through the online platform Prolific to evaluate nine different action videos. The results showed that the actions generated by GPT-4 scored significantly higher than the randomly generated actions of the control group, indicating that GPT-4 can accurately map language expressions to Alter3’s body.
- Alter3 is able to perform a variety of actions without additional training, demonstrating that the LLM dataset already contains action descriptions. In addition, Alter3 is able to imitate ghosts and animals, and reflect the emotions of the conversation content through facial expressions and gestures. The system can be applied to any humanoid robot with only minor modifications.
Key Features of Alter3
Natural Language to Action Mapping
Alter3 is able to directly convert natural language commands into robot actions. Users can control the robot to perform various tasks through simple language instructions.
Action Planning Based on the “Agent Framework”
Alter3 uses the GPT-4 model as the background to plan the steps for the robot to perform tasks through the “agent framework”. The model first acts as a planner to determine the required action steps, and then the coding agent generates specific robot commands.
Contextual Learning and API Adaptation
GPT-4 uses its contextual learning capabilities to adapt and map the robot’s API commands. By providing a list of commands and usage examples, the model can convert task steps into API commands and send them to the robot for execution.
Human Feedback Support
Alter3 is able to receive human feedback and adjust its actions accordingly. For example, when the user instructs the robot to “raise your arm”, the feedback information is processed by another GPT-4 agent, which adjusts the corresponding code and updates the action sequence.
Multi-tasking
Alter3 can perform a variety of complex tasks, such as taking selfies, drinking tea, imitating actions such as ghosts or snakes, and excels in scenes that require fine motor planning.
Emotional Expression and Imitation
GPT-4’s extensive knowledge base enables it to infer and express emotions. Alter3 is able to reflect emotions, such as embarrassment or joy, through body movements, enhancing the realism of human-computer interactions.
Alter3 can express emotions through facial expressions and body movements. Even if there is no clear emotional expression in the text, GPT-4 can infer the appropriate emotion and reflect it in the robot’s actions.
Showing surprise and delight when hearing a funny story.
Zero-shot Learning
- No pre-training required: Alter3 can generate new actions based on language instructions without pre-training, which means that no programming or training is required for each new action.
- Use existing data: Leverage GPT-4’s extensive training dataset, which contains a large number of action descriptions and supports the robot to generate a variety of actions.
Wide Range of Usage
In addition to performing daily tasks, Alter3 also has broad application potential in fields that require advanced motion planning and emotional expression, such as entertainment and customer service.
Technical details of Alter 3
Natural Language Processing and Mapping
- Integrate a large language model: Use GPT-4 as the core language model and integrate it into the Alter3 robot.
- Language to action mapping: Alter3 uses the GPT-4 model to process natural language commands and map them to specific robot actions. Through a large language model, the robot is able to understand and execute complex language instructions.
Agent Framework
- Description: Adopts an “agent framework” for action planning. The framework is divided into two stages: first, the planning stage, in which GPT-4 determines the steps required to perform the task; then, the encoding stage, which generates specific API commands.
- Planner: In the first stage, the GPT-4 model acts as a planner, analyzing natural language instructions and developing a detailed action plan.
- Coding Agent: In the second phase, the coding agent is responsible for converting action plans into API commands for the robot.
Action generation protocol
- Natural language protocol : Use natural language protocols (such as Chain of Thought, CoT) to generate Python code to control the robot, thereby achieving action generation.
- Diversity generation : Since GPT-4 is non-deterministic, even the same input can generate different action patterns, increasing the diversity of action generation.
Language Feedback System
- Instant adjustment: Users can make instant adjustments to actions through verbal instructions (such as “raise your hands higher”), and the robot will modify the action code based on the feedback.
- Action Storage: Improved actions are stored in a JSON database with descriptive tags (e.g. “holding guitar”) for easy future retrieval and use.
Contextual Learning
- Description: GPT-4 uses contextual learning capabilities to adapt to the bot’s API. By providing a list of commands and examples, the model is able to map action steps to API commands.
- Examples and command lists: Include example commands and explanations in context to help the model generate accurate API commands.
Human feedback and adjustments
- Description: Supports human feedback, allowing users to fine-tune the robot’s actions. The feedback is processed by another GPT-4 agent, which adjusts the action code and updates the execution sequence.
- Feedback processing: User feedback such as “raise arms” is processed and turned into code adjustments and stored in a database for future use.
Emotional expression
- Description: GPT-4’s knowledge base supports emotion inference and expression. Alter3 is able to reflect emotions, such as embarrassment and joy, through body movements.
- Sentiment Inference: Even in text without explicit sentiment, the model can infer the appropriate sentiment and reflect it in the robot’s physical response.
Multitasking and apps
- Description: Alter3 can perform a variety of tasks, such as taking selfies, drinking tea, and imitating the movements of ghosts or snakes, demonstrating its potential for application in daily tasks and complex scenarios.
- Example tasks: Models are experimentally tested to perform various tasks, such as taking selfies and imitating actions.
External Storage and Memory
- External Memory Integration: Through the verbal feedback system, Alter3 is able to store action improvement information in external memory, which is referenced when generating actions in the future.
- Body Schema: This external memory effectively acts as Alter3’s body schema, allowing it to continually learn and improve its performance.
Data and model training
- Description: GPT-4’s training includes extensive language representation and action descriptions, supporting its application in robot control. The knowledge base of the underlying model provides rich background knowledge, improving the robot’s task execution ability.
- Basic Challenges: Although the model performs well at high-level planning, the robot still faces challenges in performing basic tasks such as grasping objects, maintaining balance, and moving.
Evaluation and Results of Alter3
Action Generation Evaluation
- Evaluation method: Nine different generated actions were displayed using videos, and participants rated the expressiveness of the robot’s actions by watching the videos.
- Scoring criteria : A 5-point scale is used, with 1 being the worst and 5 being the best.
Participant recruitment
- Recruitment Platform: 107 participants were recruited through the Prolific platform.
- Participant task: Participants watched the video and rated the expressiveness of the movements.
Video Category
- Instant gestures: including daily and imitation actions such as taking selfies, drinking tea, pretending to be a ghost, pretending to be a snake, etc.
- Sustained Action Situations: Includes complex situations like eating someone else’s popcorn at the cinema or feeling the emotions of an old survival story while running in the park.
Control group setting
- Random Actions: Randomly generated actions are used as a control group, and the action labels are generated by GPT-4.
- Control videos: Three random action control videos were inserted into the videos that participants watched.
Statistical Analysis
- Friedman test: used to compare the significant differences between different video scores. The results showed that there were significant differences between the video scores.
- Nemenyi test: Further analysis showed that the control group videos had a significant difference in ratings compared to the other videos (p-value less than or equal to 0.001).
Summary of results
- Scoring results: The action scores generated by GPT-4 are significantly higher than those of the random action control group, indicating that the actions generated by GPT-4 have more advantages in expressiveness.
- Action diversity: GPT-4 is able to generate diverse actions ranging from everyday actions to complex scenarios, and can express emotions such as embarrassment and happiness.
- Emotional expression: Through GPT-4, Alter3 is able to understand and reflect the emotions in the conversation content, and even if the emotions are not explicitly expressed, they can be inferred and reflected in the actions.
Summary of Main Assessment Findings of Alter3
By integrating GPT-4 into the Alter3 robot, zero-sample learning, spontaneous action generation, and language feedback optimization were achieved, significantly improving the robot’s expressiveness and naturalness. Alter3 can express emotions and generate a variety of actions, demonstrating the great potential and broad application prospects of large language models in robotics. This research not only provides new ideas for the development of robotics, but also lays the foundation for more natural and humane human-computer interaction in the future.
In conclusion
Zero-shot learning capability
No pre-training required : Alter3 is able to generate natural and diverse actions through GPT-4 without specific programming or training. This shows that the training dataset of GPT-4 already contains rich action descriptions, which supports the robot to directly generate complex actions.
Language feedback optimization
Instant Adjustment : Through the language feedback system, users can instantly adjust and improve Alter3’s movements. This feedback mechanism allows the robot to continuously learn and optimize its movement performance, enhancing the interactive experience with humans.
Emotional expression
Rich emotional expression: Alter3 is not only able to imitate daily human actions, but also express emotions through facial expressions and body movements. Whether expressing emotions directly or inferring emotions from context, Alter3 can accurately reflect emotional states through GPT-4.
Wide application potential
Universality: The system can be applied to any humanoid robot with only minor modifications. This universality gives it broad potential in various robotic applications.
The study found
- Significant advantages: Evaluation results show that the actions generated by GPT-4 are significantly better than randomly generated actions in terms of expressiveness and naturalness, demonstrating the powerful capabilities of large language models in robot action generation.
- Complex scenario simulation: Alter3 is able to generate a variety of actions from simple daily actions to complex scenario simulations, demonstrating its broad application prospects in robot control and human-computer interaction.
- Author:KCGOD
- URL:https://kcgod.com/alter-3-a-humanoid-robot-powered-by-gpt44
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Google Launches Gemini-Powered Vids App for AI Video Creation
FLUX 1.1 Pro Ultra: Revolutionary AI Image Generator with 4MP Resolution
X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer
8 Best AI Video Generators Your YouTube Channel Needs
Meta AI’s Orion AR Glasses: Smart AI-Driven Tech to Replace Smartphones