In-depth research on search engine technology

Author：Eve Cole Update Time：2009-07-07 16:33:25

With the rapid development of network science and technology, people are becoming more and more dependent on network search engines. Especially in the 21st century when network resources are abundant and the demand for network information is increasing, search technology occupies a very important part of the Internet. the commanding heights. Nowadays, people often use search engines to search for various information such as multimedia materials, latest information, and maps.

First, the basic principles of search engines

A search engine is a system that can obtain website web page information, build a database and provide queries.

1.1 Structure of search engines

Web page collection is to crawl web pages through web spiders, and crawl other web pages along the links in each web page. Eventually, many web pages can be crawled, and these web pages can be compressed and stored in the knowledge base. Web spider programs will continuously crawl the entire web to ensure the timeliness and effectiveness of information.

Preprocessing is to conduct link analysis on the collected web pages, calculate the web page importance and extract keywords, and establish an index database. The architecture of this database must be conducive to search, and the information contained must be as comprehensive as possible.

Service refers to providing services to users. When the user enters a keyword, relevant information is quickly found in the index database according to the keyword and returned to the user.

1.2 Classification of search engines

Search engines can be divided into three categories: full-text search engines, directory search engines, and meta-search engines.

Full-text search engines use web spiders to crawl various web pages, extract their information and store them in a database. When the user uses them, they match the keywords entered by the user and return the information to the user. This is the most commonly used search engine. Google and Baidu fall into this category.

Directory search engines classify the searched resources in a certain way, and eventually build a large directory system. When users query, they can open and browse the directory layer by layer, and finally find the information they want. Strictly speaking, directory search engines are not a A real search engine. The Yahoo and Sina we use fall into this category.

Metasearch engine is an engine that calls other search engines. It can cover more resources and provide more comprehensive services. The most commonly used ones are Dogpile, Vivisimo and domestic star search.

The above three different search engines can be used in different situations and have their own advantages and disadvantages. Full-text search engines are generally used for comprehensive searches. Its advantages are large amounts of information, timely updates, and no need for manual intervention. Its disadvantages are that it processes large amounts of information and makes it difficult to filter information. Directory search engines are mostly oriented to websites, providing directory browsing services and direct retrieval services. Its advantage is that manual intervention is helpful to improve the accuracy of information search, but its disadvantages are that it requires manual intervention, has high maintenance costs, slow updates, and a small amount of information. Because metasearch engines can query multiple other search engines, they are particularly suitable for situations that require a high recall rate. However, currently, the specific methods or rules for establishing index databases and performing query retrieval are different among different search engines. It greatly affects the retrieval effect of meta-search tools.

Second, several key technologies for search engine implementation

2.1 Spiders

Web spiders can be implemented in the following ways:

(1) Based on breadth first. A breadth-first based algorithm accesses links in the order they are encountered. It is the simplest strategy of all web spiders.

(2) Based on depth first. Based on the idea of depth priority, the similarity between the web page and the search topic is calculated according to the selected conditions, and the link with the highest similarity is selected for search. In the calculation process of similarity, cosine is usually used for calculation.

(3) Based on page ratings. Based on webpage ranking, the webpage rating is used in combination with the content to rate the searched document collection, and the calculated results are used to select the link with the highest rating as the next search object.

(4) InfoSpider. InfoSpider uses evolved keyword tables and neural network methods to calculate the similarity of web pages related to the topic, and determines the next object to be searched based on the calculation results. The cost spent in obtaining the document modifies the agent's energy, and determines whether to undo, regenerate, or survive the agent based on its energy level.

2.2 Judgment of the importance of web pages

There are two main methods for judging the importance of web pages, one is based on links

method, and the other is based on similarity.

There must be some credible mapping relationship between the link information and the linked object based on the calculation based on the link method. The following are often used during application:

(1) In-degree: the number of web pages containing link targets pointing to this web page;

(2) Out-degree: the number of web page links linked from this web page;

(3) Page Rank: refers to the possibility of a user visiting the webpage at any time.

This method is widely used and very effective.

For calculations based on similarity, the vector space model is generally used to convert the query string and text into vectors, and then the similarity between the text and the query string is evaluated.

2.3 Establishment of search engine hardware system

The hardware system of the search engine is the backbone of the entire system. In order to provide faster query speed, the hardware system generally adopts a distributed structure. Google's servers are distributed around the world, and parallel technology is also used to speed up the execution speed. In addition, the hardware design of the index database is also very important and is critical to improving data access speed.

Third, search engine counter-development trend

The search engines of the future will have the following characteristics:

(1) Able to collect almost all information on the Internet;

(2) Some illegal information can be blocked;

(3) Improvement of recall rate and precision rate

(4) Not only can it recognize text search terms, but it can also recognize images, audios, videos, etc.;

(5) Information updates faster;

(6) Convenient introduction to cross-database query;

(7) The interactive interface is humanized and personalized;

(8) Intelligent search can be realized.

(9) Mobile search will make great progress.

Fourth, summary

This article explains the search engine in detail, analyzes the implementation of its key technologies, and proposes future development trends. With the development of technology and the improvement of people's needs, search engines will become more and more intelligent. , becoming more and more efficient and practical.