Agent Q: MultiOn's Self-Evolving AI

type

status

date

slug

summary

What problem does Agent Q solve?

Limitations of static datasets

Traditional LLMs supervised training on static datasets can perform well in natural language tasks, but they show poor generalization ability in dynamic interactive environments that require multi-step decision-making (such as network navigation). This results in the model being unable to make decisions autonomously and effectively when faced with complex tasks, and is prone to compound errors, leading to suboptimal results.

Multi-step reasoning and compound errors

In multi-step decision-making tasks, models often have difficulty handling tasks over long time spans because the reward signals in the environment are sparse and once an error is made in a certain step, it is difficult to correct, which makes the overall decision process less robust. Agent Q reduces the impact of such compound errors and improves task success rate by introducing self-criticism and guided search.

Adaptability in complex environments

Traditional AI agents often perform poorly in complex and dynamic environments and are unable to flexibly respond to changes. Agent Q enhances AI’s adaptability and decision-making capabilities in complex environments by integrating multiple cutting-edge algorithms.

To cope with complex multi-step environments, Agent Q introduces tree search capabilities, enabling agents to conduct multiple levels of in-depth searches on web pages to correct possible errors. This search process is driven by process supervision, where the agent proposes possible actions at each step and ranks these actions through a critic model. This critic model acts as a generative value function, signaling the direction of future technology development.

In addition, Agent Q also uses a zero-shot visual language model (VLM) for result supervision to reason and verify the success of the task. In the experiment, the research team used GPT-4V and hopes to integrate multimodal LLaMa in the future because the visual element is crucial to the accuracy of the task.

In practical applications, Agent Q uses Monte Carlo Tree Search (MCTS) to search on web pages, which greatly improves the performance of the agent. The team also deployed a reinforcement learning loop to feed the supervision of the tree search back to the base agent to further improve zero-shot performance. In simulation, Agent Q outperformed GPT-4 in zero-shot evaluation and performed on par with humans in search at reasoning time. More importantly, Agent Q was successfully applied to the actual booking task of an online website, demonstrating excellent autonomous improvement capabilities in this complex environment.

However, despite significant progress, there are still many research issues to be addressed. For example, the team currently uses a frozen zero-shot generative critic model, but future research is needed on how to fine-tune or improve it through training. In addition, despite extensive reinforcement learning training, there is still a clear gap between zero-shot performance and search performance, and finding out the reason for this gap is one of the focuses of future research.

In addition, the research on Agent Q also explored the most appropriate search method. Monte Carlo tree search and A* algorithms were used in the existing work. In the future, new and better search methods may be discovered through reinforcement learning to achieve self-correction and exploration of agents. Finally, the team also faces the challenge of how to ensure security in large-scale deployment on the Internet, because although the model can self-correct and eventually complete the task, it may not be able to repair the consequences of its errors. Therefore, how to use user feedback, human participation, and security criticism to train the model will be an important direction for future research.

Framework features: Agent Q uses the understanding capabilities of the current generation of large language models (LLMs) to process web content, create task plans, and reason in natural language. This framework focuses on task execution over long time spans.

Tree search capability: To cope with complex multi-step environments, Agent Q adopts a tree search capability that enables the agent to conduct in-depth multi-level searches on web pages to help correct potential errors.

Process Supervision and Generative Criticism: The search process of Agent Q is driven by process supervision. At each step, the agent proposes possible actions, which are then ranked by a critic model to assess their promise. This critic model acts as a generative value function that signals future directions.

Result supervision and visual model: Agent Q also uses a zero-shot visual language model (VLM) for result supervision to reason and verify the success of the task. GPT-4V was used in the experiment, and we hope to integrate multimodal LLaMa in the future. Visual elements are crucial to accuracy.

Monte Carlo Tree Search (MCTS) and Reinforcement Learning: Agent Q uses Monte Carlo Tree Search (MCTS) to search the web, significantly improving agent performance. At the same time, a reinforcement learning loop is deployed to feed the supervision of the tree search back to the base agent, further improving zero-shot performance.

Simulation and Real Application: In simulation, Agent Q outperforms GPT-4 in zero-shot evaluation and performs on par with humans in search at inference time. In addition, Agent Q is successfully applied to a real-world reservation task for an online website and performs well.

Model Improvements and Research Questions: The fine-tuned LLaMa 70B model significantly outperforms GPT-4 in zero-shot performance. After a day of autonomous play, performance jumped from 18.6% to 81.7%, and the online search success rate reached 95.4%. However, there are still many research issues to be addressed, including how to improve the generative criticism model, explore better search methods, and how to address security challenges in large-scale deployment.

The main capabilities of Agent Q:

Advanced planning capabilities

Multi-step reasoning: Agent Q is able to perform multi-step reasoning in complex tasks, planning and executing multiple steps to complete the target task. It is able to effectively make decisions in dynamic environments and flexibly adjust strategies to adapt to changing situations.

Agent Q is able to excel in complex tasks that require multi-step decision making. This capability enables the model to not only generate text, but also perform a series of actions in an interactive environment, such as completing tasks step by step in scenarios such as website navigation and reservations. By combining MCTS search and DPO algorithms, Agent Q is able to better plan and execute complex tasks.

Self-healing ability

AI Self-Criticism: Agent Q self-evaluates at every step of the way and adjusts its behavior based on feedback. This self-healing ability allows Agent Q to self-correct when it encounters errors or obstacles, avoiding falling into unfavorable decision paths.

Through the self-criticism mechanism, the model can generate intermediate feedback at each decision step, which helps the model to make more accurate exploration and decisions when faced with tasks with long time spans and complex paths. This capability ensures that the model can continuously improve its decision-making strategy.

Guided Search

Monte Carlo Tree Search (MCTS):

Agent Q uses MCTS technology to make decisions. It can find the best action sequence by exploring different action paths and balancing exploration and exploitation. This enables Agent Q to generate diverse and optimal solutions in unknown or complex web navigation tasks.

MCTS helps the model to balance exploring new possible paths with using known high-reward paths when faced with complex decision trees, thereby finding the optimal solution. This enables the model to improve its success rate when faced with tasks with high complexity and uncertainty.

Reinforcement Learning from Human Feedback (RRFH)

Direct Preference Optimization (DPO): Agent Q learns the best decision from human feedback through the DPO algorithm. This algorithm enables Agent Q to effectively utilize a variety of data, including suboptimal paths, for training, thereby improving the success rate in complex environments.

This learning mechanism allows the model to optimize its strategy, not only relying on successful cases, but also extracting useful information from failures to avoid similar mistakes in the future. This ability makes the model more robust and adaptable.

Automatic improvement and self-optimization capabilities

Agent Q can continuously improve its decision-making strategy through autonomous exploration and online learning. Through the combination of MCTS and DPO algorithms, the model can improve autonomously under limited supervision and gradually improve its task execution capabilities. This automated improvement mechanism enables Agent Q to adapt to new tasks and environments.

Online search and dynamic environment adaptability

Agent Q has the ability to perform online search in real-time environments, which greatly improves the performance of the model in dynamic environments. When faced with changing information or environment, the model can adjust its decision-making strategy in real time to ensure efficient completion of tasks.

Autonomous Data Collection and Learning

Zero-Shot Learning:

Agent Q can quickly improve its performance in new tasks through autonomous data collection and learning without explicit training data. For example, in the web booking experiment, Agent Q was able to significantly improve the task success rate within one day.

Technical Methods of Agent Q

1. Monte Carlo Tree Search (MCTS)

Function: MCTS is a search algorithm for decision making that helps Agent Q effectively explore possible action paths when faced with complex, multi-step tasks. MCTS considers already explored paths (exploitation) and continuously tries new paths (exploration) when selecting the best path in the decision tree, thereby increasing the chance of success.

How it works: At each decision node, Agent Q simulates multiple possible actions, calculates the potential benefits of each action, and then selects the best path for further exploration. This approach enables Agent Q to make smarter choices during task execution.

Example: Agent Q is like an explorer looking for an exit in a maze. It will not walk around blindly, but will simulate several possible paths in its mind to see which path is most likely to lead to the exit, and then actually walk. This helps it find the right direction faster.

Agent Q: Monte Carlo Tree Search (MCTS) — Agent Q: **Monte Carlo Tree Search (MCTS)**

2.Self-Critique Mechanism

Function: The self-criticism mechanism allows Agent Q to reflect and evaluate its own decisions while performing tasks, thus getting immediate feedback at each step. This mechanism helps the model adjust its strategy in time to avoid small mistakes accumulating into big problems.

How it works: After each action is taken, Agent Q generates a self-evaluation to determine whether the action was effective. If the judgment result is not good, the model will record it so that it can make better choices in similar situations in the future.

Example: Every time Agent Q makes a decision, it will stop and think: "Is my choice a good one?" If it thinks it is not a good one, it will remember this lesson and will not make the same mistake again next time it encounters a similar situation.

3. Direct Preference Optimization (DPO)

Function: DPO is a learning algorithm for optimizing the decision-making ability of models in multi-step tasks. It helps Agent Q learn from successful and failed task trajectories and optimize its decision-making strategy.

How it works: DPO compares the pros and cons of different action paths and learns the best decision-making method. This method not only relies on successful experience, but also learns from failures, making the model perform better when facing similar tasks.

Example: Agent Q will review all the decisions it has made, not only the successful ones, but also the failed ones. By comparing which decisions worked well and which did not, it will learn to choose the best option more intelligently.

4. Reinforcement Learning (RL)

Function: In Agent Q, reinforcement learning is used to continuously improve the decision-making ability of the model. The model continuously acquires experience through interaction with the environment and optimizes its strategy based on these experiences.

How it works: Agent Q receives rewards or penalties during the execution of tasks (e.g., successfully booking a restaurant seat or failing), and then adjusts its action strategy based on this feedback, gradually learning to make better decisions in complex environments.

Example: Just like training a pet, every time Agent Q does something right, it will be rewarded, which makes it more inclined to continue doing it in the future; if it makes a mistake, it will be reminded to avoid making the same mistake next time. In this way, it slowly learns how to complete the task better.

5.Online search and dynamic learning

Function: Agent Q is able to perform real-time online searches while performing tasks and dynamically adjust its decisions based on new information. This capability enables it to work effectively in a constantly changing environment.

How it works: When performing a task, Agent Q will use tools such as search engines to obtain the latest information, and then adjust its action plan based on this information to improve the success rate of task completion.

Example: If the environment suddenly changes, such as the restaurant you want to book is full, Agent Q will not panic, but will immediately look for new information, adjust the plan, and find another suitable solution.

6. Process supervision and feedback

Function: Agent Q uses a process supervision method to obtain intermediate feedback at each step of the decision. This feedback comes not only from the final result, but also includes reflection and evaluation of each step in the decision-making process.

How it works: This supervision mechanism helps the model to perform more detailed exploration and optimization over long time spans by allowing it to evaluate each decision step.

Example: While doing something, Agent Q will constantly ask itself: "Am I doing it right now?" This real-time check helps it ensure that every step is done as well as possible.

**Process supervision and feedback of Agent Q**

Experimental Results of Agent Q

In real-world experiments, Agent Q demonstrated its excellent performance, especially in the web booking task. The following are the specific experimental results:

Initial performance

In the Zero-Shot task using the LLaMa-3 model, Agent Q achieved an initial success rate of 18.6%

Improvements after autonomous data collection:

After one day of autonomous data collection, Agent Q’s success rate jumped to 81.7% , a 340% improvement in performance . This result shows that Agent Q is able to quickly learn from its own exploration and significantly improve its task success rate.

Further optimization

Through online search and further optimization, Agent Q's success rate was further improved to 95.4%. This result demonstrates Agent Q's ability to continuously optimize and self-improve in complex tasks.

Specific Examples of Agent Q

In experiments with Agent Q, the researchers validated the system’s performance in two main scenarios: a simulated e-commerce platform WebShop environment and a real reservation site OpenTable. These experimental results demonstrate Agent Q’s powerful capabilities and improvements.

1. WebShop simulation environment

Experimental background: WebShop is a simulated e-commerce platform, and Agent Q needs to complete a series of predetermined tasks in this virtual environment, such as finding specific products and purchasing them. This is a multi-step complex task that requires navigating between multiple pages and making correct decisions.

Experimental results:

The success rate of the base model(without using Agent Q’s method) in WebShop is 28.6%.
The success rate of the model after using Agent Q increased to 81.7%.
After further combining with online search, the success rate was further improved to 95.4%.

Summary: Agent Q significantly improves task completion rates in the WebShop environment, significantly outperforming traditional behavior cloning and reinforcement learning baseline models.

2. OpenTable real booking environment

Experimental background: OpenTable is a real online reservation platform, where Agent Q needs to help users complete restaurant reservation tasks, including selecting date, time, number of people, etc. This task is more complicated because it involves real-world variables and possible unexpected situations (such as full seats during a certain time period).

Experimental results:

The base model without Agent Q achieved a success rate of 18.6% on the OpenTable task.
After preliminary training (reinforcement learning and DPO optimization), the success rate increased to 71.8%.
The full Agent Q model

achieved a success rate of 81.7% without online search support.

After combining online search capabilities, Agent Q's success rate is further improved to 95.4%.

Summary: In an actual booking scenario, Agent Q significantly improves the success rate of task completion, even surpassing the average human performance.

Official website: https://www.multion.ai/

Technical report: https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf

💡

Security Risks in Video Streaming? Protect your content and viewers

cloudcone | affordableconnectivity gov att