Tips for Avoiding Spider Crawls and Indexing Errors: Bypassing Conflicts

Author：Eve Cole Update Time：2011-09-06 17:10:34

As you know, you can't always rely on spider engines to operate efficiently when accessing or indexing your site. By relying entirely on their own ports, spiders will generate a lot of duplicate content, treat important pages as garbage, index link entries that should not be shown to users, and have other problems. There are some tools that allow us to fully control the activities of spiders within the website, such as meta robots tags, robots.txt, canonical tags, etc.

Today, I will talk about the limitations of the use of robot control technology. In order to prevent spiders from crawling a certain page, webmasters sometimes use multiple robot control technologies to prohibit search engines from accessing a certain web page. Unfortunately, these techniques can sometimes conflict with each other: on the other hand, such restrictions can hide certain dead links.

So, what happens when a page's robots file is blocked from access, or is used with noindex tags and canonical tags?

Quick review

Before we get into the topic, let’s take a look at some limiting techniques of mainstream robots:

metabot tags

The Meta Robots Tag establishes page rank descriptions for search engine robots. The meta robot tag should be placed at the head of the HTML file.

canonical tag

The canonical tag is a page-level meta tag located in the HTML header of a web page. It tells search engines which URLs are displayed properly. Its purpose is to prevent search engines from crawling duplicate content, and at the same time concentrate the weight of duplicate pages on the standardized page.

The code is like this:

X Robot Tags

Since 2007, Google and other search engines have supported X-Robots-Tag as a way to tell spiders to prioritize crawling and indexing. files for use. This tag is useful for controlling the indexing of non-HTML files, such as PDF files.

robot tag

robots.txt allows some search engines to enter the website, but it does not guarantee whether a specific page will be crawled and indexed. Unless it's for SEO reasons, robots.txt is really only worth using if it's really necessary or if there are robots on the site that need to be blocked. I always recommend using the metadata tag "noindex" instead.

avoid conflict

It is unwise to use two methods to restrict robot entry at the same time:

· Meta Robots 'noindex'

· Canonical Tag (when pointing to a different URL)

· Robots.txt Disallow

· X-Robots-Tag (x robot tag)

As much as you want to keep your page in the search results, one approach is always better than two. Let's take a look at what happens when there are many robots path control techniques in a single URL.

Meta Robots 'noindex' and Canonical tags

If your goal is to pass the authority of one URL to another URL, and you have no other better way, then you can only use the Canonical tag. Don't get yourself into trouble with the "noindex" of meta robot tags. If you use the two-robot method, search engines may not see your Canonical tag at all. The effect of the weight transfer will be ignored because the robot's noindex tag will prevent it from seeing the Canonical tag!

Meta Robots 'noindex' & X-Robots-Tag 'noindex'

These labels are redundant. When these two tags are placed on the same page, I can only see the negative impact on SEO. If you can change the header file in metabot 'noindex', you shouldn't use the xbot tag.

Robots.txt Disallow &Meta Robots 'noindex'

Here are the most common conflicts I've seen:

The reason why I prefer Meta Robots' "noindex" is because it effectively prevents a page from being indexed, while still passing weight to deeper pages that are connected to this page. This is a win-win approach. The robots.txt file is not allowed to completely restrict search engines from seeing the information on the page (and the valuable internal links within it), and specifically cannot restrict the URL from being indexed. What are the benefits? I once wrote a separate article on this topic.

If both tags are used, robots.txt is guaranteed to make Meta Robots 'noindex' invisible to spiders. You will be affected by the disallow in robots.txt and miss out on all the benefits of Meta Robots 'noindex'.

The source of the article is www.leadseo.cn Shanghai Leadseo, a website optimization expert. Please keep the source when reprinting! Thank you very much!

Editor in charge: Chen Long Author frank12's personal space