Design signposts for Web Robot on your homepage

Author：Eve Cole Update Time：2009-06-20 17:01:09

The Internet is getting cooler and cooler, and the popularity of the WWW is at its peak. Publishing company information and conducting e-commerce on the Internet has evolved from fashion to fashion. As a Web Master, you may know HTML, Javascript, Java, and ActiveX well, but do you know what a Web Robot is? Do you know what the relationship between Web Robot and the homepage you design is?

Wanderers on the Internet --- Web Robot

Sometimes you will inexplicably find that the content of your home page is indexed in a search engine, even though you have never had any contact with them. In fact, this is exactly what Web Robot does. Web Robots are actually programs that can traverse the hypertext structure of a large number of Internet URLs and recursively retrieve all the content of a website. These programs are sometimes called "spiders", "Web Wanderers", "web worms" or Web crawlers. Some well-known search engine sites (Search Engines) on the Internet have specialized Web Robot programs to complete information collection, such as Lycos, Webcrawler, Altavista, etc., as well as Chinese search engine sites such as Polaris, NetEase, GOYOYO, etc.

Web Robot is like an uninvited guest. Whether you care about it or not, it will be loyal to its master's responsibilities, working hard and tirelessly on the World Wide Web. Of course, it will also visit your homepage, retrieve the content of the homepage and generate the record format it needs. . Maybe you would like to have some home page content known to the world, but some content you don’t want to be seen or indexed. Can you just let it "run rampant" in your homepage space? Can you command and control the whereabouts of Web Robot? The answer is of course yes. As long as you read the rest of this article, you can be like a traffic policeman, laying out road signs one by one, telling Web Robot how to search your homepage, which ones can be searched, and which ones cannot be accessed.

In fact, Web Robot can understand your words.

Don't think that Web Robot is running around without organization and control. Many Web Robot software provides two methods for website administrators or web content producers to restrict the whereabouts of Web Robots:

1. Robots Exclusion Protocol

Administrators of Internet sites can create a specially formatted file on the site to Indicate which part of the site can be accessed by robots. This file is placed in the root directory of the site, i.e. http://.../robots.txt .

2. Robots META tag

A web page author can use a special HTML META tag. To indicate whether a web page can be indexed, analyzed, or linked.

These methods are suitable for most Web Robots. Whether these methods are implemented in the software depends on the Robot developer, and they are not guaranteed to be effective for any Robot. If you urgently need to protect your content, you should consider additional protection methods such as adding passwords.

Using Robots Exclusion Protocol

When Robot visits a Web site, such as http://www.sti.net.cn/ , it first checks the file http://www.sti.net.cn/robots.txt. If this file exists, it will be analyzed according to this record format:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

to determine whether it should retrieve the site's files. These records are specially for Web Robot to see. Ordinary viewers will probably never see this file, so do not add HTML statements like <img src=*> or "How do you do" in it. "Where are you from?" and other false greetings.

There can only be one "/robots.txt" file on a site, and each letter of the file name must be all lowercase. Each separate "Disallow" line in Robot's record format indicates a URL that you do not want Robot to access. Each URL must occupy a separate line, and sick sentences such as "Disallow: /cgi-bin/ /tmp/" cannot appear. At the same time, blank lines cannot appear in a record, because blank lines are a sign of dividing multiple records.

The User-agent line indicates the name of the Robot or other agent. In the User-agent line, '*' has a special meaning---all robots.

Here are a few examples of robot.txts that

deny all robots on the entire server:
User-agent: *
Disallow: /

Allow all robots to access the entire site:
User-agent: *
Disallow:
Or generate an empty "/robots.txt" file.

Parts of the server are accessible to all robots
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

Reject a specific robot:
User-agent: BadBot
Disallow: /

Only allow one robot to visit:
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /

Finally we give the robots.txt on the http://www.w3.org/ site:
# For use by search.w3.org
User-agent: W3Crobot/1
Disallow:
User-agent: *
Disallow: /Member/ # This is restricted to W3C Members only
Disallow: /member/ # This is restricted to W3C Members only
Disallow: /team/ # This is restricted to W3C Team only
Disallow: /TandS/Member # This is restricted to W3C Members only
Disallow: /TandS/Team # This is restricted to W3C Team only
Disallow: /Project
Disallow: /Systems
Disallow: /Web
Disallow: /Team

Using Robots META tag

Robots META tag allows HTML web page authors to indicate whether a page can be indexed or whether it can be used to find more linked files. Currently, only some robots implement this feature.

The format of Robots META tag is:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Like other META tags, it should be placed in the HEAD area of the HTML file:
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>
...

Robots META tag instructions are separated by commas. The instructions that can be used include [NO]INDEX and [NO] FOLLOW. The INDEX directive indicates whether an indexing robot can index this page; the FOLLOW directive indicates whether the robot can follow links to this page. The default is INDEX and FOLLOW. For example:
<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">

A good Web site administrator should take the management of robots into consideration so that robots can serve their own homepage without compromising the security of their own web pages.