Scientists propose a new type of agent, one step closer to achieving full-proces
Last year, large language models represented by ChatGPT became the "game-changers" in the entire field of AI.
Among them, what is particularly astonishing is their general ability in text scene understanding, text generation, and code generation. At the same time, scientists in this field have found that they can use these basic models to interact with the external world, allowing them to independently complete specific tasks that are close to human real life.
For example, helping people shop online; finding and moving certain items in an indoor environment described by text, etc. And this kind of subject that can complete tasks autonomously is called an agent.
At present, in order to train agents to improve their performance, researchers are committed to using multi-step reasoning and action trajectories as training data. However, collecting the above trajectories requires a lot of manpower, whether through manual annotation or implementing different prompting frameworks.
For this reason, recently, a research team from Tsinghua University proposed an agent called ActRe to help agents achieve the whole process of data collection and self-evolution autonomously.As a new type of intelligent agent, ActRe differs from the widely known ReAct, which follows the "reason-then-act" rule, and belongs to the category of "act-then-reason."
Advertisement
That is to say, ActRe reverses the causality of textual reasoning and action execution in ReAct to achieve a description of textual reasons for any given action.
"In the execution process of the ReAct intelligent agent, one can first sample the next action to be taken. After obtaining the new action, it can then be sent to ActRe to obtain a textual reason description for this action.
Then, by placing this textual reason description at the front and the sampled action at the back, the format of ReAct's reason-then-act is formed," explained Yang Zonghan, a doctoral student at Tsinghua University.
After the mutual cooperation and exploration of ReAct and ActRe, they can obtain large-scale and diversified trajectories in the environment, which come with their own reasoning annotations. At the end of the trajectory execution, the simulated environment will give a final score, which naturally becomes the standard for judging the quality of the trajectory.Experiments have proven that the data collected in the aforementioned manner possesses an exceptionally high quality.
Yang Zonghan stated: "Even if ReAct fails by itself, after incorporating the exploration of ActRe, it often achieves trajectories with high scores."
It is by utilizing these data of victories and defeats that the intelligent agent is able to conduct comparative self-training, thereby significantly enhancing its capabilities.
Ultimately, the research team achieved exceptionally outstanding results in the WebShop and AlfWorld environments used for the experiments.
Specifically, they used an open-source language model with 7 billion parameters and an intelligent agent obtained through efficient fine-tuning of QLoRA parameters, surpassing a multitude of intelligent agent frameworks based on GPT-4, as well as existing intelligent agents with full parameter fine-tuning based on large language models of 47 billion and even 70 billion parameters.At present, intelligent agents driven by large language models have a high application value, and this research has significantly advanced the intelligent agents one step closer to achieving full process autonomy.
Based on this, intelligent agents in the future are expected to play the role of assistants to humans in many aspects, helping humans to free themselves from many repetitive tasks.
Recently, the relevant paper was published on the preprint platform arXiv with the title "ReAct Meets ActRe: When Language Agents Enjoy Training Data Autonomy" [1]. Yang Zonghan is the first author, Professor Liu Yang and Research Associate Professor Li Peng from Tsinghua University serve as corresponding authors.
Training open-source large language models to become better intelligent agents.It is understood that as early as 2022, Dr. Yao Shunyu from Princeton University and his collaborators had proposed using the simulated online shopping environment WebShop to test the capabilities of intelligent agents. Each time a shopping session ends, the simulated environment returns a score to inform how well the shopping went and to inquire whether the initial shopping needs were met.
However, before the emergence of large models like ChatGPT, AI could only achieve a maximum success rate of 29% in specialized training on simulated environments like WebShop. In comparison, the average success rate for humans was 50%, while experts achieved a 60% success rate.
After the advent of large language models with general capabilities, simply constructing a simple prompt, without additional training, allowed the large model to understand the textual environment and generate actions by imitating the cat drawing the tiger, ultimately achieving a 40% success rate through continuous iterative interactions.
"Compared with the previous success rate, this has already made a leap. The most critical thing is that everyone found that large models are really versatile, regardless of the scenario, a simple prompt can be written, and without training, the large language model can be directly asked to try to perform the task," said Yang Zonghan.
Therefore, since March 2023, a large number of open-source tools, scientific research work, and entrepreneurial projects related to intelligent agents have been born one after another.Among them, what impressed me the most was that NVIDIA's researchers proposed using GPT-4 to play the open game 'Minecraft'. Additionally, researchers from Stanford University in the United States proposed using many large models to simulate different characters, and then place them in an environment, similar to the science fiction series 'Westworld', to simulate an interactive human society. Yang Zonghan said.
As a researcher who has been involved in natural language processing since 2017, he was amazed by the ability to use language models and the environment for interaction, and began to explore research in this area.
However, in practice, he found that if only through writing prompts to "empower" the intelligent agent, the latter often blindly follows the prompt instructions.
Yang Zonghan said: "Although it seems to understand the textual description of the environment and can give some responses, from the perspective of actual execution, it actually does not fully understand."
In fact, after the intelligent agent performs several tasks in simulated environments such as WebShop, it will accumulate many successful or failed trajectories.So, can we further study these past trajectories to make the agent stronger when facing new tasks?
It is worth mentioning that, although there are already many prompt-based intelligent agent frameworks on the market, they often need to call the basic model's application programming interface (API, Application Programming Interface), which brings a great financial cost.
And the capabilities of the basic model's API (such as ChatGPT, GPT-4) will change over time, which is not friendly to the developers of prompt-based intelligent agent frameworks.
In other words, developers may construct a well-performing intelligent agent framework based on a certain API within a certain period of time, but after a period of time, if this API is taken offline by the provider, developers will have to try again on the API alternative, which will cause a significant fluctuation in the performance of the intelligent agent.
How can we control the underlying basic model that drives the intelligent agent?Yang Zonghan pointed out: "We believe that intelligent agents can be trained using open-source language models."
That is to say, if the weights of the model are all in hand, everything will become more controllable. In addition, the success of deep learning itself is obtained through the training of neural networks. Therefore, to make the intelligent agent smarter, it should try the training method.
This is the origin of this research, which is to train open-source language models through training to make them better intelligent agents, and then achieve better performance on tasks that have not been seen in the environment.
Proposed a new type of intelligent agent that can achieve a level beyond human in all unseen test scenarios.After setting the research objectives, challenges also come along.
Firstly, can the open-source large language models support the team in training a good agent?
Nowadays, open-source models, especially those that can be easily experimented with in the lab, are generally at the 7B parameter scale. They still have a significant gap in general capabilities compared to proprietary models like GPT-4.
Secondly, the method based on prompts cannot turn the base model into a specialized model. If you want to fine-tune a 7B open-source model, is it possible to turn this model into a specialized model?
In addition, even though it is a model with 7B parameters, fine-tuning all of its parameters also requires a considerable amount of computing power.Thus, as a preliminary experiment, Yang Zonghan adopted a parameter-efficient fine-tuning scheme: instead of training the base model with 7B parameters, he adjusted this model by training some lightweight pluggable parameter modules.
"By using the QLoRA method, I can conduct experiments on one or two 24GB VRAM graphics cards at the laboratory level. Moreover, due to its pluggable nature, when the base model with 7B parameters does not add the QLoRA parameter module, it remains the untrained general model itself," said Yang Zonghan.
However, even so, the real highlight has just begun.
Obviously, training must have data, and the source of data comes from the interaction trajectory between the agent and the environment.
Existing methods mainly rely on two ways to obtain data, one is to use trajectory data completely annotated by human experts; the other is to implement different prompt-based intelligent agent frameworks, allowing them to obtain trajectory data in the process of interacting with the environment.Unfortunately, neither of these methods can obtain large-scale diversified trajectory data. The former relies entirely on human annotators, which incurs a high labor cost; the latter relies on prompts-based agents, which can only obtain trajectories by imitating the prompts like drawing a tiger by looking at a cat.
However, large-scale diversified trajectory data is the key to the success of language models.
How can this problem be solved?
Researchers were inspired by the agent framework itself. They re-examined the existing agent frameworks and found that they generally first use text to reason, as the reason for the actions to be taken next, and then generate the actions to be taken.
ReAct follows such a thought process. The clever part of this agent is that people can change the agent's actions during the execution process by modifying the reasoning content generated by the language model itself.In that case, humans only need to modify the reasoning content in key places, and the agent can complete the rest of the trajectory.
Despite this, using human labor to directly and massively modify the trajectory of ReAct still requires a high cost.
If the agent can autonomously label the trajectory, it is equivalent to being able to collect data on its own, and also use the collected data to train itself. After completing self-training, it can be deployed into the environment to make decisions automatically to complete tasks, thus truly achieving full process autonomy.
Based on this, to drive the agent to autonomously label the reasoning content, the research team proposed the agent ActRe, which can not only automatically collect trajectory data with reasoning annotations, but also use the data for self-training, thus forming a closed loop.
Since the two basic points of the agent field are reflected in more complex real environments and more efficient learning mechanisms, this study chose to start from the latter. After four rounds of iteration in the WebShop environment, the obtained agent achieved a success rate of 55% in all unseen test scenarios, compared with an average success rate of 50% for humans.After four rounds of iteration in the AlfWorld environment, the proposed agent ultimately achieved a 100% success rate in all previously unseen test scenarios.
"This not only proves the effectiveness of our method but also indicates the need for experiments in more complex real-world environments," said Yang Zonghan.
It is worth mentioning that in this regard, the research team has also conducted research on the "unified alignment principle of agents" [2].
When talking about the most memorable content in the research process, Yang Zonghan said it was the endless stream of excellent papers that were at least counted in weeks.
"The paper was submitted to arXiv on March 21, 2024, and cited a total of 39 papers, of which 13 were submitted to arXiv this year," he further stated, "Facing such a fast-paced research rhythm, it is inevitable that anxiety will arise, amplify, and spread in the heart."However, when Yang Zonghan realized the self-identification at the level of natural language processing, he felt very fortunate.
After all, the success of language models has not only made things that were unimaginable a few years ago gradually become a reality, but also made many more unimaginable things imaginable. For him, behind the anxiety, it also reflects his pursuit of self-realization.
In his view, the completion of this research is thanks to the discussions with many students such as Liu An, Liu Zijun, and Liu Kaiming in the research group, as well as the support of Professor Liu Yang and Professor Li Peng.
"I am very fortunate to be able to participate in this wave of intelligent agent development," said Yang Zonghan.
Leave a Reply