You will find a robots.txt file in the FTP of many websites. Many webmasters only know that this is a file that restricts spider access. Does this file have any other functions? Let’s learn together.
What is a robots file? This file is a communication bridge between search engines and websites. It is a syntax file agreed between the two. Every time a search engine crawls a website, it will first check this file, just like the key to the door. Same. If this file does not exist, it means that search engine crawling is not restricted. If this file exists, it will be crawled according to the requirements specified in the file. Some webmasters may ask, when we build a website, we definitely need it to be indexed by search engines, so why should we restrict its crawling? Search engines will search the entire site during the crawling process, and there may be some content in your website that you collected. things, or similar pages with no substantial content, then the evaluation of your website will be greatly reduced after the search engine crawls it, and it will not have an SEO effect. However, the robots file can tell the spider which pages it does not want it to go to. See, it also indirectly reduces the load on the server.
There are several things to note about this file:
1. The file name cannot be spelled incorrectly and must be lowercase, and the suffix must be .txt.
2. The file must be placed in the root directory of the website, such as: http://www.taofengyu.com/robots.txt , and must be accessible.
3. The syntax of the file content must be correct. Generally, User-agent and Disallow are commonly used:
User-agent:* means that all search engine spiders are allowed to crawl and include. If you do not want Baidu to include your website, replace * with "baiduspider", then the content restricted by Disallow will not be crawled and included by Baidu spiders. Included. If you want to restrict the crawling of the entire site, then the Disallow file must be written as "Disallow:/". If you want to restrict the files in a certain folder from being crawled, then write it as "Disallow:/admin/". If you To restrict the crawling of files starting with admin, write "Disallow:/admin", and so on. You want to restrict the crawling of files in a certain folder. For example, you want to restrict crawling of the index.htm file in the admin folder. , then the Disallow syntax is written as "Disallow:/admin/index.htm". If there is no "/" after Disallow, it means that all pages of the website are allowed to be crawled and included.
Generally, there is no need for spiders to crawl the background management files, program functions, database files, style sheet files, template files, some pictures of the website, etc.
4. The Disallow function must exist in the robots file, which is a key factor for the establishment of this file.
That’s all about the importance and methods of robots files. I hope it can be of some use to everyone.
The article comes from Taofengyu Student Supplies Network http://www.taofengyu.com/ Please indicate the source when reprinting and respect the author's labor.
The author’s personal space for children’s toys on Taobao