A brief discussion on how to write website optimization robots.txt file

Author：Eve Cole Update Time：2012-02-25 15:18:10

Robots.txt file, friends have more or less heard of it, or they may have written it themselves. In fact, I haven't written a robots.txt file myself so far. It's not that I can't write it. I just feel that there is nothing in the blog that needs to be prevented from being crawled by spiders. And everyone must also know that the probability of dead links in a personal independent blog should be very small, and there is no need to deal with dead links too much, so I don’t think it is necessary. However, the robots.txt file writing method is one of the skills that individual webmasters must master, and its uses are still very wide. Here is a detailed introduction, which can be regarded as a review for yourself.

What is robots.txt file

Judging from the file name, it has a .txt suffix. You should also know that this is a text file, which is Notepad. Robots, those who know some English should all know it, means robot. To us, this robot represents a search engine robot. From the name, you can guess that this file is specially written by us for spiders to read. Its function is to tell the spider that those columns or pages do not need to be crawled. Of course, it can also directly block the access of a certain spider. Note that this file is placed in the root directory of the website to ensure that the spider can read the file content as soon as possible.

The role of robots files

In fact, the robots file is most commonly used by us to block dead links within the website. Everyone should know that too many dead links on a website will affect the weight of the website. However, although it is not troublesome to clean up the dead links on the website, it still takes a lot of time. Especially if there are many dead links in the website, it will be very laborious to clean up. At this time, the usefulness of the robots file will be reflected. We can directly Write these dead links into files according to the format to prevent spiders from crawling. If you want to clean them up, you may clean them up slowly in the future. Some website content contains URLs or files that webmasters do not want spiders to crawl, and they can also be blocked directly. For shielding spiders, they are generally used less.

How to write robots file

This point should be more important. If you write something wrong and try to block it but fail, but if you write something you want to be captured and you can't find it in time, you will suffer a big loss. First of all, we need to know the two tags, Allow and Disallow, one is allowed and the other is not allowed. Everyone can understand its function.

User-agent: *

Disallow:

User-agent: *

Allow:

These two paragraphs of content indicate that everything is allowed to be crawled. In fact, the Disallow tag is used to block URLs and files, unless your website only has a few that you want to be crawled and use the Allow tag. This User-agent: is followed by the spider name. Everyone should be familiar with the mainstream search engine spider names. Let’s take Soso Spider as an example: Sosospider.

When we want to block Soso Spider:

User-agent: sosospider

Disallow: /

You can find that compared with the above permission, this shielding spider only has one more "/", and its meaning has changed dramatically. Therefore, you must be careful when writing, and you cannot block the spider itself just because you write an extra slash. But don't know. Also, in user-agent: if the spider name followed by "*" is fixed, it means it is for all spiders.

To prohibit a directory from being crawled by search engine spiders, the setting code is as follows:

User-agent: *

Disallow: /directory/

Note that if you want to prevent the crawling of a directory, you must pay attention to "/" in the directory name. Without "/", it means that access to this directory page and pages under the directory is blocked, and with "/", it means entering the blocked directory. On the content page below, these two points must be clearly distinguished. If you want to block multiple directories, you need to use

User-agent: *

Disallow: /directory1/

Disallow: /directory2/

This form cannot be in the form of /directory 1/directory 2/.

If you want to prevent spiders from accessing a certain type of file, for example, to prevent crawling of images in .jpg format, you can set it to:

User-agent: *

Disallow: .jpg$

The above is Shanghai SEO Xiaoma's writing method for the robots file of the entire website. It only talks about the types and precautions of the robots writing method. There is a little less description of specific writing methods such as targeted spider blocking or other writing methods, but you know the meaning of allow and disallow. Thinking about it can derive many other meanings of writing. There are also ways to write robots meta web page tags for specific web pages, but generally not many are used.

The above is compiled by Shanghai seo pony http://www.mjlseo.com/ , please indicate when reprinting, thank you

Editor-in-Chief: Yangyang author Xiaoma Laoma’s personal space