I promised to write an article for Ah Bin a long time ago. I am grateful for his help to me, but I have not written it until now. A few days ago, I saw Zhuo Shao asked a question about robots, and I compiled it for everyone. Let’s talk about some situations of robots. The robots.txt file is placed in the root directory of the website and is the first file that search engines view when accessing the website. When a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will determine the scope of access based on the contents of the file; if the file does not exist, all The search spiders will be able to access all pages on the website that are not password protected. Every website should have a robot, which tells search engines which things on my website are not allowed to be crawled, and which pages are welcome to be crawled and crawled.
Several functions of robots:
1. Block all search engines from crawling information. If your website is just your private website and you don’t want too many people to know about it, you can use robots to block all search engines, such as a private blog you write. You can block all search engines
User-agent: *
Disallow: /
2. If you only want a certain search engine to crawl your information, you can use robots to set it up at this time. For example: I only want my website to be included in Baidu, but not other search engines. You can use robots to set it up
User-agent: Baiduspider
Allow:
User-agent: *
Disallow: /
3. You can use various wildcards to configure the website accordingly. For example, if I don’t want the website to crawl all my pictures, I can use $ to set it up. Generally, our common image formats are BMP, JPG, GIF, JPEG and other formats. The settings at this time are:
User-agent: *
Disallow: /.bmp$
Disallow: /.jpg$
Disallow: /.gif$
Disallow: /.jpeg$
4. You can also use * to block related URLs. When some websites do not allow search engines to crawl dynamic addresses, you can use this * wildcard to set matching settings. Under normal circumstances, one of the characteristics of dynamic URLs is that there is "?". At this time, we can use this feature to perform matching blocking:
User-agent: *
Disallow: /*?*
5. If the website is revised and the entire folder is gone, in this case, you should consider blocking the entire folder. We can use robots to block the entire folder. For example, the ab folder in the website has been deleted due to revision. In this case, it can be set like this:
User-agent: *
Disallow: /ab/
6. If there is a folder in the website that you do not want to be included, but there is information in this folder that is allowed to be included. Then you can use the allow of robots to set it. For example, the ab folder in my website is not allowed to be crawled by search engines, but there is an information CD in the ab folder that is allowed to be crawled. At this time, you can use robots to set it:
User-agent: *
Disallow: /ab/
Allow:/ab/cd
7. The location of the site map can be defined in robots, which is beneficial to the inclusion of the website.
sitemap:<sitemap location>
8. Sometimes you will find that robots are set up in my website but you also find that it includes this URL address. The reason for this is because the spider of this search engine crawls to the web page through the URL. Generally, Google crawls URLs like this. There is no title and description, but when Baidu crawls this URL, it will bring the title and description, so many people will say that I set up robots but it has no effect. The actual situation is that the link is crawled but the content of the page is not included.
The homepage of the website has the highest weight, and the weight is transferred by links. We set up robots to better transfer the weight to those pages that need to have high weight, and some pages do not need to be crawled and crawled by search engines. of.
Editor in charge: Chen Long Author︶ Shitou Peng's personal space