The latest discovery is that Baidu Spider is a fool! Recently, I found that Baidu’s inclusion of the website is very slow. I basically take a new snapshot of the homepage after a few days, and other pages are basically not included! Depressing! Really depressing! Open the IIS log of the website and check it I downloaded Baidu Spider and was shocked! I made a major discovery: Baidu Spider is really a fool!
1. Let’s first look at how Baidu Spider is so stupid. The following is the activity record of Baidu Spider on my website.
1. 2009-06-03 21:26:05 W3SVC962713505 218.60.130.19 GET /robots.txt - 80 - 123.125.64.15 Baiduspider+(+http://www.baidu.com/search/spider.htm) 404 0 64 ( Note: 404 indicates robots.txt not found)
2. 2009-06-03 21:26:49 W3SVC962713505 218.60.130.19 GET /index.asp - 80 - 123.125.64.15 Baiduspider+(+http://www.baidu.com/search/spider.htm) 200 0 64 ( Note: 200 indicates that the homepage file index.asp was found)
It can be seen from this that the Baidu spider's activities first go to the website to find the file robots.txt. If it does not exist, find the homepage index.asp of the website. After comparing it with the homepage currently included in Baidu, it is found that there is no change from the original one, and then leaves. Like most webmasters, who doesn’t want to update snapshots of pages included in Baidu from time to time? It seems that the only way to complete robots.txt is to lead Baidu spiders to run around my site.
2. Write robots.txt and take Baidu to look around on your site.
robots.txt This file must be written. Do you all know how to write it specifically? If not, I will repeat it again.
Example 1. Disable all search engines from accessing any part of the website
User-agent: *
Disallow: /
Example 2. Allow all robots to access
(Or you can also create an empty file "/robots.txt")
User-agent: *
Disallow:
or
User-agent: *
Allow: /
(Note from the table: This is necessary. Do not create an empty file. That is Baidu smoking. It is best to write the following sentence.)
Example 3. Only ban Baiduspider from accessing your website
User-agent: Baiduspider
Disallow: /
Example 4. Only allow Baiduspider to access your website
User-agent: Baiduspider
Disallow:
User-agent: *
Disallow: /
Example 5. Prohibit spiders from accessing specific directories
In this example, the website has three directories that restrict search engine access, that is, the robot will not access these three directories. It should be noted that each directory must be declared separately and cannot be written as "Disallow: /cgi-bin/ /tmp/".
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
Example 6. Allow access to some URLs in a specific directory
User-agent: *
Allow: /cgi-bin/see
Allow: /tmp/hi
Allow: /~joe/look
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
Example 7. Use "*" to restrict access to URLs
Access to all URLs with the ".htm" suffix (including subdirectories) in the /cgi-bin/ directory is prohibited.
User-agent: *
Disallow: /cgi-bin/*.htm
Example 8. Use "$" to restrict access to URLs
Only URLs with the ".htm" suffix are allowed to be accessed.
User-agent: *
Allow: .htm$
Disallow: /
Example 9. Disable access to all dynamic pages in the website
User-agent: *
Disallow: /*?*
Example 10. Prohibit Baiduspider from crawling all images on the website
Only web pages are allowed to be crawled, no images are allowed to be crawled.
User-agent: Baiduspider
Disallow: .jpg$
Disallow: .jpeg$
Disallow: .gif$
Disallow: .png$
Disallow: .bmp$
Example 11. Only allow Baiduspider to crawl web pages and .gif format images
It is allowed to capture web pages and images in gif format, but it is not allowed to capture images in other formats.
User-agent: Baiduspider
Allow: .gif$
Disallow: .jpg$
Disallow: .jpeg$
Disallow: .png$
Disallow: .bmp$
Example 12. Only prohibit Baiduspider from grabbing .jpg format images
User-agent: Baiduspider
Disallow: .jpg$
Take a look at the robots.txt written by the table itself, for your reference
Copy code
User-agent: *
Disallow: /admin/
Disallow: /Soft/
Allow: /images/
Allow: /html/
Allow: .htm$
Allow: .php$
Allow: .asp$
Allow: .gif$
Allow: .jpg$
Allow: .jpeg$
Allow: .png$
Allow: .bmp$
Allow: /
explain:
1. Allow indexing by various search engines
2. Disable indexing of the /admin directory. This is the backend of the website. Of course, it is prohibited.
3. Disable important security directories such as /soft
4. Allow access to the /images directory
5. Allow access to the /html directory
6. Allow access to all htm, php, asp, html files
7. Allows grabbing pictures in gif, jpg, jpeg, png, bmp formats
8. Allows crawling of files in the root directory of the website.
Okay, upload your robots.txt to the website and directory and wait for Baidu Spider to come again. When the time comes, this good guide will take that idiot to your station and walk around. The author of this article is collected and published by MOFHOT foreign trade clothing wholesale network www.mofhot.com. Please leave a link in A5. Thank you~ It is not easy to publish an article.