Academic research relies on efficient literature search, but existing search engines are difficult to meet the needs of complex professional queries. For example, non-stationary reinforcement learning research for specific algorithms (such as UCB methods) requires stronger search and analysis capabilities. Researchers often spend a lot of time and effort manually retrieving huge academic databases. This article introduces PaSa, a self-developed by ByteDance Research Institute and Peking University, an autonomous academic paper search agent based on large language model (LLM), aiming to solve this problem.
In the field of academic research, literature search is a complex and important task to obtain information. Researchers need to be able to handle complex, expertise areas of search capabilities to meet meticulous research needs. However, existing academic search platforms, such as Google Scholar, often struggle to cope with these complex research queries. For example, professional queries for non-stationary reinforcement learning using UCB methods require stronger computing and analytical capabilities. In addition, researchers often need to spend a lot of time and effort manually browsing huge academic databases when conducting literature reviews.
Although several studies have explored the application of large language models (LLMs) in academic paper search and scientific discovery, traditional search tools still have difficulty meeting complex professional research needs. Many studies focus on the development of LLM agents through optimization frameworks and prompt engineering technologies. Although methods such as AGILE RL framework have significantly improved the comprehensive capabilities of agents, an autonomous and accurate academic paper search solution has not been found, which brings research to the A big gap came.
Recently, ByteDance Research Institute and researchers from Peking University jointly proposed PaSa, an innovative LLM-based paper search agent. PaSa can autonomously execute complex search strategies, including tool calls, paper readings and reference selection, aiming to generate comprehensive and accurate results for complex academic queries. To optimize PaSa's performance, the research team created AutoScholarQuery, a synthetic dataset containing 35,000 fine-grained academic queries, and established RealScholarQuery as a benchmark for evaluating the actual performance of the agent. The system utilizes reinforcement learning techniques to enhance search capabilities, solving the main limitations in existing academic search methods.
The PaSa system consists of two LLM agents: a crawler and a selector that work together to perform a comprehensive academic paper search. The crawler first analyzes the user's queries to generate multiple granular search queries to obtain relevant papers and adds these papers to a dedicated paper queue. Crawlers process each queued paper, identify and explore key citations that may expand the scope of the research, and dynamically add newly discovered related papers to the list. The selector will then evaluate whether each paper meets the original query requirements.
Experimental results show that PaSa-7b performs superiorly in multiple benchmark tests. On the AutoScholarQuery test set, PaSa-7b has increased 9.64% in recall compared to PaSa-GPT-4o. When facing Google-based benchmarks, PaSa-7b's recall rate increased between 33.80% and 42.64%. In the more challenging RealScholarQuery scenario, PaSa-7b shows a 30.36% recall increase and a 4.25% accuracy increase.
In general, the launch of PaSa marks an important advance in academic paper search technology and provides an effective solution for information retrieval of academic research. By combining large language models and reinforcement learning techniques, PaSa greatly reduces the time and effort invested by researchers in literature reviews, while also providing them with an efficient tool to deal with an increasingly large and complex academic literature environment.
Code: https://github.com/bytedance/pasa
Paper: https://arxiv.org/abs/2501.10120
Points:
**PaSa is an intelligent academic paper search agent jointly launched by ByteDance and Peking University researchers. **
** This system consists of two LLM agents, crawler and selector, and can independently execute complex search strategies. **
** Experimental results show that PaSa-7b performs better than existing search methods in multiple benchmark tests, significantly improving the efficiency and accuracy of paper search. **
The emergence of PaSa has brought revolutionary changes to academic research. It has significantly improved the efficiency and accuracy of academic paper searches, saving researchers a lot of time and energy, so that they can focus on more important research work. In the future, the further development and application of PaSa is worth looking forward to.