There are abundant resources on the Internet, but how to effectively search for information is difficult. Building a search engine is the best way to solve this problem. This article first introduces the system structure of the Internet-based search engine in detail, and then gives a detailed explanation from three aspects: network robot, index engine, and Web server. In order to have a deeper understanding of this technology, I also personally implemented a search engine of my own - a news search engine. The news search engine parses and searches specified web pages according to hyperlinks, and indexes each piece of news found and adds it to the database. Then the Web server accepts the client request and searches for the matching news from the index database. In the chapter introducing the search engine, in addition to elaborating on the core technology in detail, I also combined the implementation code of the news search engine to illustrate, with pictures and texts that are easy to understand.
Table of Contents Table of Contents 1 Summary 3 Chapter 1 Introduction 4 Chapter 2 The structure of search engines 5 2.1 System Overview 5 2.2 Composition of search engines 5 2.2.1 Network robot 5 2.2.2 Indexing and Search 5 2.2.3 Web server 6 2.3 Main indicators and analysis of search engines 6 Section 2.4 6 Chapter 3 Network Robot 7 3.1 What is a network robot 7 3.2 Structural analysis of network robots 7 3.2.1 How to parse HTML 7 3.2.2 Spider program structure 8 3.2.3 How to construct a Spider program 9 3.2.4 How to improve program performance 11 3.2.5 Code analysis of network robots 12 Section 3.3 14 Chapter 4 Indexing and Search Based on LUCENE 15 4.1 What is LUCENE full text search 15 4.2 Principle analysis of LUCENE 15 4.2.1 Implementation mechanism of full-text retrieval 15 4.2.2 Lucene’s indexing efficiency 15 4.2.3 Chinese word segmentation mechanism 17 4.3 Combination of LUCENE and SPIDER 18 Section 4.4 21 Chapter 5 TOMCAT-based WEB server 22 5.1 What is a TOMCAT-based WEB server 22 5.2 User interface design 22 5.3.1 Client design 22 5.3.2 Server design 23 5.3 Deploy the project on TOMCAT 25 Section 5.4 25 Chapter 6 Search Engine Strategy 26 6.1 Introduction 26 6.2 Topic-oriented search strategy 26 6.2.1 Guide words 26 6.2.3 Authoritative web pages and central web pages 27 Section 6.3 27 Reference 28