Perspective Search Engine Principle SE Simulation Development

Author：Eve Cole Update Time：2009-06-01 01:50:33

I have been busy studying website optimization recently, and I suddenly got interested and briefly studied the principles of SE. After reading this article, it will definitely be a big gain for SEOers. Only by better understanding the search engine mechanism and principles can they get better rankings.

The technical problems that search engines need to solve are generally divided into spider programs + classification and indexing + vocabulary + sorting algorithm factors + database indexing and optimization + database structure

1. Spider. At present, it seems that spiders can be implemented in C or PHP. Most of Baidu's spiders are made of C. C can also support multiple database connection interfaces, and C's operating efficiency is higher than that of PHP, and C can also control the bottom layer better. Although C is so good, I still want to use PHP. Time must be saved more, and we can no longer learn C. If you need to use C in the future if you need high efficiency, use C again. The database can remain unchanged, and MYSQL can be connected to C. PHP has advantages and disadvantages. It should not be a big problem if you are a spider. The biggest problem is that it can be very slow. When crawling web pages, problems that may arise are the crawling order, how to record if the crawl is unsuccessful or times out, and when to update the crawl next time. The search engine database is new at first, without any URLs, and a large number of URLs need to be added. You can use a for loop statement here to automatically loop according to the English letters. Of course, the website is not only in English, but also - and numbers, which can only be entered manually. If you still grab it in a loop, it is estimated that many of them will fail. The captured code needs to be analyzed to see whether the encoding type is utf-8 or gb2312. My search engine only wants to capture simplified Chinese. If the fetch times out, it will be recorded and will be fetched again about ten days later. If it times out three times in a row, it will be taken out of the database.

2. Index creation is a very difficult problem. Baidu and Google can use their own server farms to build distributed servers. I don't have that many servers. So I wanted to try another approach. Create static pages. I didn’t know before that it would take about 0.2 seconds to enter a relatively uncommon word in Baidu and Google, while common words only take 0.1 seconds. Moreover, the query time required to repeatedly enter a word the second time is much less. This is probably the impact of the index. If the index is placed in memory, the reading speed will be very OK. I only have one server, and even if I only put an index of 50,000 common query terms into it, it would probably be a bit tiring. A page must be at least 20K, and 50,000 pages is 20K*50=1G. This is just the first page of 50,000 words. If the user wants to turn pages for query, the memory will definitely not be enough. If only the first page is put into the memory and the user turns pages to query, the speed will not be improved. So I'm going to go full static. Simulate a query of 50,000 words and then generate a static page. The first page of all words is placed in the memory, and subsequent pages are placed in the hard disk. If the page could be put into memory, this problem would have been solved.

3. Vocabulary. There are thousands of Chinese characters, and there are at least 3,000 commonly used Chinese characters. It is estimated that there are 20,000 commonly used words composed of it. How to add this thesaurus? In what format should it be stored? CSV file, database, or text file? Previously, I thought about finding the thesaurus file of Kingsoft PowerWord and trying to copy it directly. This method has not been successful yet.

4. All algorithms that affect sorting should be placed in a table. Then there are fixed algorithms, which are factors of the website itself, and variable algorithms, which are factors that change due to the words input by the user or due to different times, seasons, etc. The fixed algorithm is placed in a table and the total score is calculated for each website. Part of the change algorithm has been generated before, and part of it is calculated after user input.

5. There is currently no good solution for database indexing. There must not be too many expression indexes, as too many will affect the speed.

6. Database structure. This is critical. It is estimated that the database structure must be finalized before the front-end interface of the website comes out. It is also necessary to leave an interface for future upgrades, such as adding algorithm factors, or changing fields in order to optimize query statements, etc. The preliminary structure is like this. 1-3 tables store website information. The first field is the auto-incremented primary key, the second field is the homepage address of the website, and in order are the website domain name registration time, collection time, last snapshot time, total number of pages included, number of bytes on the homepage, domain name classification (com/ cn/org/net/gov/edu), total number of backlinks, website classification (this can be 1-10, the portal can be expanded to 30), etc.

The article is reproduced from: www.jianfeiyiqi.com Please indicate the source with the link.