Observing and analyzing the logs of the website, we found that many pages of the website were repeatedly crawled by spiders, which is not very good for the optimization of the website. So how do we prevent website pages from being crawled repeatedly by spiders?
1. Use the robots file to block this page. The specific method is as follows:
Disallow: /page/ #Restrict crawling of WordPress pagination. If you need to check your website, you can also write the following statements together to avoid too many duplicate pages. * Disallow: /category/*/page/* #Restrict the crawling of category paging* Disallow:/tag/ #Restrict the crawling of tag pages* Disallow: */trackback/ #Restrict the crawling of Trackback content* Disallow:/category/* #Restrict crawling of all category lists What is a spider? It is also called a crawler. It is actually a program. The function of this program is to read some information layer by layer along the URL of your website, do simple processing, and then feed it back to the backend server for centralized processing. We must understand the preferences of spiders in order to optimize the website better. Next let's talk about the working process of spiders.
2. Spider encounters dynamic pages
Spiders face problems when processing dynamic web page information. Dynamic web pages refer to pages automatically generated by programs. Now that the Internet is developed, there are more and more scripting languages for program development, and more and more dynamic web page types are naturally developed, such as jsp, asp, php and other languages. It is difficult for spiders to process web pages generated by these scripting languages. When optimizing, optimizers always emphasize not using JS code as much as possible. To perfectly handle these languages, spiders need to have their own scripts. When optimizing the website, reduce some unnecessary script codes to facilitate spider crawling and avoid repeated crawling of the page!
3. Spider’s Time
The content of the website changes frequently, either through updates or template changes. Spiders also constantly update and crawl the content of web pages. Spider developers will set an update cycle for the crawler, allowing it to scan the website according to the specified time to see and compare which pages need to be updated, such as: Home page Whether the title has been changed, which pages are new pages on the website, which pages are dead links that have expired, etc. The update cycle of a powerful search engine is constantly optimized, because the update cycle of the search engine has a great impact on the recall rate of the search engine. However, if the update cycle is too long, the search accuracy and integrity of the search engine will be reduced, and some newly generated web pages will not be searchable; if the update cycle is too short, the technical implementation will be more difficult and the bandwidth will be affected. , causing a waste of server resources.
4. Spider’s non-repeated crawling strategy
The number of web pages on the website is very large, and spider crawling is a huge project. Cracking web pages requires a lot of line bandwidth, hardware resources, time resources, etc. If the same web page is frequently crawled repeatedly, it will not only greatly reduce the efficiency of the system, but also cause problems such as low accuracy. Usually search engine systems have designed a strategy of not repeatedly crawling web pages. This is to ensure that the same web page is crawled only once within a certain period of time.
This is the introduction on how to avoid repeated crawling of website pages. The article is edited by Global Trade Network.
Editor-in-Chief: Chen Long Author Fuzhou SEO Planning’s personal space