We are standing at an exciting turning point in the development of artificial intelligence.
Imagine what artificial intelligence will look like in the future? With just a simple command, they can understand and perform complex tasks; they can also visually capture the user's expressions and movements to determine their emotional state. This is no longer a scene in a Hollywood science fiction movie, but the era of AI agents that is gradually entering reality.
As early as November 2023, Microsoft founder Bill Gates wrote that agents will not only change the way everyone interacts with computers, but will also subvert the software industry and bring about the largest computing revolution since we started typing commands to clicking icons. revolution. OpenAI CEO Sam Altman has also stated on multiple occasions that the era of building huge AI models is over, and AI agents are the real challenge in the future. In April this year, Andrew Ng, a well-known AI scholar and professor at Stanford University, pointed out that agent workflow will drive huge progress in AI this year, and may even surpass the next generation of basic models.
Analogous to smart electric vehicles, just as they find a certain balance between new energy technology applications and range anxiety, AI agents allow artificial intelligence to enter the "range extension mode". Between AI technology and industry applications Try to reach a new balance whenever possible.
As the name suggests, an AI agent is an intelligent entity that can autonomously perceive the environment, make decisions and perform actions. It can be a program, a system, or a robot.
Last year, a joint research team from Stanford University and Google published a research paper titled "Generative Agents: Interactive Simulation of Human Behavior." In the article, 25 virtual people living in the virtual town of Smallville showed various human-like behaviors after accessing ChatGPT, thus igniting the concept of AI agents.
Since then, many research teams have integrated the large models they developed into games such as "Minecraft". For example, Nvidia's chief scientist Jim Fan created an AI agent named Voyager in "Minecraft". Soon, Voyager has shown a very superb learning ability. It can learn the skills of digging, building houses, collecting, hunting and other games without any teacher. It can also adjust its resource collection strategies according to different terrain conditions.
OpenAI once listed a five-level roadmap to achieve general artificial intelligence: L1 is a chatbot; L2 is a reasoner, which is an AI that can solve problems like a human; L3 is an agent, which is an AI that can not only think but also take action System; L4 is the innovator; L5 is the organizer. Among them, AI agents happen to be at a critical position in connecting the past and the future.
As an important concept in the field of artificial intelligence, academia and industry have proposed various definitions of AI agents. Roughly speaking, an AI agent should have human-like thinking and planning capabilities, and have certain skills to interact with the environment and humans to complete specific tasks.
Perhaps we can better understand by analogizing AI agents to digital humans in a computer environment - the brain of a digital human is a large language model or artificial intelligence algorithm that can process information and make decisions in real-time interactions; the perception module is It is equivalent to the sense organs such as eyes and ears, which are used to obtain information about different environmental states such as text, sound, and images; the memory and retrieval module is like neurons, used to store experience and assist decision-making; the action execution module is the limbs, used to execute Decisions made by the brain.
For a long time, humans have been pursuing artificial intelligence that is more "human-like" or even "superhuman", and intelligent agents are considered to be an effective means to achieve this pursuit. In recent years, with the improvement of big data and computing power, various deep learning large models have developed rapidly. This provides tremendous support for the development of a new generation of AI agents, and has made significant progress in practice.
For example, Google's DeepMind artificial intelligence system demonstrated the AI agent "RoboCat" for robots; Amazon Cloud Technology launched the Amazon Bedrock agent, which can automatically decompose enterprise AI application development tasks and so on. Agents in Bedrock are able to understand goals, formulate plans and take action. New memory retention capabilities allow agents to remember and learn from interactions over time, enabling more complex, longer-running and more adaptive tasks.
The core of these AI agents is artificial intelligence algorithms, including machine learning, deep learning, reinforcement learning, artificial neural networks and other technologies. Through these algorithms, AI agents can learn from large amounts of data and improve their own performance, constantly optimize their decisions and behaviors, and can also flexibly adjust according to changes in the environment to adapt to different scenarios and tasks.
Currently, AI agents have been used in many scenarios, such as customer service, programming, content creation, knowledge acquisition, finance, mobile assistants, industrial manufacturing, etc. The emergence of AI agents marks the advancement of artificial intelligence from simple rule matching and computational simulation to a higher level of autonomous intelligence. It promotes the improvement of production efficiency and the transformation of production methods, and opens up a new realm for people to understand and transform the world.
Moravec's paradox points out that for artificial intelligence systems, high-level reasoning requires very little computing power, while achieving the perceptual-motor skills that humans are accustomed to requires huge computing resources. In essence, complex logical tasks are easier for AI than basic sensory tasks that humans can do instinctively. This paradox highlights the gap between current AI and human cognitive abilities.
The famous computer scientist Andrew Ng once said: "Humans are multi-modal creatures, and our AI should also be multi-modal." This sentence expresses the core value of multi-modal AI - making machines closer to human recognition. knowledge to achieve more natural and efficient human-computer interaction.
Each of us is like an intelligent terminal. We usually need to go to school to receive knowledge (training), but the purpose and result of training and learning is that we have the ability to work and live independently without always relying on external instructions and control. People understand the world around them through multiple sensory modes such as vision, language, sound, touch, taste and smell, and then assess the situation, analyze, reason, make decisions and take action.
The core of AI agents lies in "intelligence", and autonomy is one of its main features. They can complete tasks independently and according to preset rules and goals without human intervention.
Imagine a driverless car equipped with advanced cameras, radars, and sensors. These high-tech "eyes" allow it to "observe" the world around it, capturing the real-time conditions of the road, the movements of other vehicles, and the movements of pedestrians. Information such as location and changes in traffic signals. This information is transmitted to the self-driving car's brain, a complex intelligent decision-making system that can quickly analyze the data and formulate corresponding driving strategies.
For example, in the face of complex traffic environments, self-driving cars can calculate the optimal driving route and even make complex decisions such as changing lanes when necessary. Once decisions are made, execution systems translate these intelligent decisions into specific driving actions, such as steering, accelerating, and braking.
In large-scale agent models built based on huge data and complex algorithms, interactivity is more obvious. Being able to "understand" and respond to humans' complex and changeable natural language is the magic of AI agents - they are not only able to "understand" human language, but are also able to interact smoothly and insightfully.
AI agents can not only quickly adapt to various tasks and environments, but also continuously optimize their performance through continuous learning. Since the breakthrough of deep learning technology, various agent models have become more accurate and efficient through continuous accumulation of data and self-improvement.
In addition, AI agents are also very adaptable to the environment . Automated robots working in warehouses can monitor and avoid obstacles in real time. When it senses a change in the location of a shelf, it will immediately update its path plan to effectively complete the task of picking and handling goods.
The adaptability of AI agents is also reflected in their ability to adjust themselves based on user feedback. By identifying users' needs and preferences, AI agents can continuously optimize their behavior and output and provide more personalized services, such as music recommendations for music software, personalized treatments for smart medical care, and more.
The emergence of multi-modal large models and world models has significantly improved the perception, interaction and reasoning capabilities of agents. Multimodal large models can handle multiple perception modes (such as vision, language), allowing agents to more comprehensively understand and respond to complex environments. The world model provides the agent with stronger prediction and planning capabilities by simulating and understanding the laws in the physical environment.
After years of sensor fusion and AI evolution, robots are basically equipped with multi-modal sensors at this stage. As edge devices such as robots begin to have more computing power, these devices are becoming increasingly intelligent, able to sense their surroundings, understand and communicate in natural language, gain touch through digital sensing interfaces, and use accelerometers, gyroscopes Combined with a magnetometer, etc., it can sense the robot's specific force, angular velocity, and even the magnetic field around the robot.
Before the emergence of Transformer and large language models (LLM), to implement multimodality in AI, it was usually necessary to use multiple separate models responsible for different types of data (text, images, audio), and to process different modalities through a complex process. status for integration.
After the emergence of Transformer and LLM, multi-modality has become more integrated, allowing a single model to process and understand multiple data types at the same time, resulting in an AI system with more powerful comprehensive perception of the environment. This transformation has greatly improved multi-modality. Efficiency and effectiveness of modal AI applications.
Although LLMs such as GPT-3 are primarily text-based, the industry has made rapid progress toward multi-modality. From OpenAI's CLIP and DALL·E to the current sora and GPT-4o, they are all model examples moving towards multi-modal and more natural human-computer interaction.
For example, CLIP understands images paired with natural language, thereby bridging visual and textual information; DALL·E aims to generate images based on textual descriptions. We see the Google Gemini model going through a similar evolution.
In 2024, multi-modal evolution will accelerate. In February this year, OpenAI released Sora, which can generate realistic or imaginative videos based on text descriptions. If you think about it, this could provide a promising path to building a general-purpose world simulator, or become an important tool for training robots.
Three months later, GPT-4o significantly improved the performance of human-computer interaction and was able to reason between audio, vision, and text in real time. Comprehensive use of text, visual and audio information to train a new model end-to-end, eliminating the two modal conversions from input modality to text, and from text to output modality, thus greatly improving performance.
Multimodal large models are expected to change the analysis, reasoning and learning capabilities of machine intelligence, turning machine intelligence from specialized to general-purpose. Generalization will help expand the scale and produce economic effects of scale. The price can also be greatly reduced as the scale expands, and then be adopted by more fields, thus forming a virtuous cycle.
By simulating and expanding human cognitive abilities, AI agents are expected to be widely used in many fields such as medical care, transportation, finance, and national defense. Some scholars speculate that by 2030, artificial intelligence will boost global GDP growth by about 12%.
However, while seeing the rapid development of AI agents, we must also see the technical risks, ethics and privacy issues they face. A group of securities trading bots briefly wiped out $1 trillion in value on stock exchanges such as Nasdaq through high-frequency buying and selling contracts. A chatbot used by the World Health Organization provided outdated drug review information. A senior American The lawyer failed to realize that the historical case documents he provided to the court were all fabricated out of thin air by ChatGPT... These real cases show that the hidden dangers brought by AI agents should not be underestimated.
Because AI agents can make decisions independently and can exert influence on the physical world through interaction with the environment, once they get out of control, they will pose a great threat to human society. Harvard University professor Zitrain believes that this kind of AI agent that can not only talk to people but also act in the real world is "a step across the blood-brain barrier between digital and analog, bits and atoms" and should attract attention. vigilance.
First of all, AI agents will collect a large amount of data in the process of providing services, and users need to ensure data security and prevent privacy leaks.
Secondly, the stronger the autonomy of an AI agent, the more likely it is to make unpredictable or inappropriate decisions in complex or unforeseen situations. The operating logic of AI agents may cause harmful deviations in the process of achieving specific goals, and the security risks it brings cannot be ignored. In more popular terms, in some cases, the AI agent may only capture the literal meaning of the target, without understanding the essential meaning of the target, and thus make some wrong behaviors.
Thirdly, the "black box" and "illusion" problems inherent in the AI large language model will also increase the frequency of operational abnormalities. There are also some "cunning" AI agents that can successfully circumvent existing security measures. Relevant experts point out that if an AI agent is advanced enough, it will be able to recognize that it is being tested. Some AI agents have been found to be able to identify safety tests and suspend inappropriate behavior, which would lead to the failure of testing systems that identify algorithms that are dangerous to humans.
In addition, since there is currently no effective exit mechanism for AI agents, some AI agents may not be able to be shut down after being created. These AI agents, which cannot be deactivated, may end up operating in a completely different environment than when they were initially launched, completely deviating from their original purpose. AI agents may also interact in unforeseen ways, causing accidents.
To this end, humans need to start as soon as possible from the development and production of AI agents, and continuous supervision after application deployment, and formulate relevant laws and regulations in a timely manner to standardize the behavior of AI agents, so as to better prevent the risks brought by AI agents. , Prevent the occurrence of out-of-control phenomena.
Looking to the future, AI agents are expected to become the key carrier of the next generation of artificial intelligence. It will not only change the way we interact with machines, but may also reshape the operating model of the entire society. It is becoming a new gear in the process of promoting the transformation of artificial intelligence. .