Be careful not to let robots.txt block the crawling of the link

Author：Eve Cole Update Time：2009-06-05 22:45:04

We know that a large number of webmasters are looking for a way to prevent spiders from crawling their pages on their websites, and they also do this by using the robot.txt file. While this is indeed a good practice, the problem also presents itself: confusion when using robot.txt to prevent Google/Yahoo!/MSN or some other search engine spiders from crawling! Here is a brief explanation:

Prevent crawling through Robots.txt: Some URL addresses do not want to be accessed, but can still be crawled and appear in search engine results pages.

Blocked by the NoIndex of the META tag: it can be accessed, but it does not want to be crawled and it does not want to be listed in the search results.

Block by disabling crawling of links on the page: This is not a very smart move because there are some other links that would still want to crawl the page to index it! (If you don't care this will waste spiders on your page. You can also do this if you want to increase the search time, but don’t think that doing so will prevent it from appearing on the search engine results page)

Here is a simple example. Although spider crawling is restricted in robot.txt, it will still appear in Google search results.

(robot.txt files are also valid for subdomains)

We can see that the /library/nosearch/ file of about.com has been blocked. The following figure shows the results when we search the URL address in this file in Google:

Notice that Google still has 2,760 search results in so-called organized categories. They did not crawl these pages, so all they saw was a simple link address, no description and no title, because Google could not see the content of these pages.

Let us further imagine that if you have a large number of web pages that you do not want to be crawled by search engines, these URL addresses will still be counted and accumulate traffic and other unknown independent ranking factors, but They can't follow the link, so the links pouring out of them can never be seen, see the image below:

Here are two convenient methods:

1. Save these link traffic by using the nofollow command when linking to prohibited directories in robot.txt.

2. If you know the fixed link flows of these banned pages (especially those brought by external links), you can consider using meta's noindex and follow instead, so that spiders will skip these link flows to save money. Time to retrieve more pages in your website that need it!

This article comes from reamo personal SEO technology, online promotion blog: http://www.aisxin.cn Please indicate the source when reprinting.