If pages pointed to by different links have a lot of the same content, this phenomenon is called "duplicate content." If a website has a lot of repeated content, search engines will think that the value of the website is not high. So we should try to avoid all kinds of duplicate content.
Duplicate content on dynamic websites is often caused by URL parameters, and URL rewriting will worsen this phenomenon (which is quite intriguing, haha). Because if the original URL parameters are used, the search engine may make appropriate judgments and learn that the duplicate content is caused by URL parameters, and automatically handle it accordingly; URL rewriting will cover up the URL parameters, which will make the Search engines don't recognize URL parameters. for example:
Original URL:
http://www.freeflying.com/articles.aspx?id=231&catelog=blog
http://www.freeflying.com/articles.aspx?id=231&catelog=news
URL after URL rewriting:
http://www.freeflying.com/blog/231.html
http://www.freeflying.com/news/231.html
The content of the pages pointed to by these URLs are actually the same, both are the article with id=231, but this article is referenced by the blog and news columns. For various reasons, our final URL is still as above shown.
There are two ways to deal with it. One is to use the robot protocol to "exclude" one of them, and the other is to permanently redirect one of the URLs to another URL through 301.
Today we will talk about the robot protocol first. To put it simply, robot refers to a search engine. For Google, we also call it a "spider". Spiders are very polite and will ask for your opinion first before crawling your web content. You and the robot have communicated based on the robot protocol before. Specific to implementation, there are two ways:
1. Add a robots.txt text to the root directory of the website, such as:
#static content, forbid all the pages under the "Admin" folder
User-agent: *
Disallow: /Admin
The # line indicates a comment;
User-agent refers to the search engine, * means for all search engines, you can also specify a specific search engine, such as User-agent: googlebot;
Disallow specifies directories or pages that are not allowed to be accessed. Note: 1. This text is case-sensitive; 2. It must start with "", indicating the website root directory;
Like the purpose of this series, we focus on ASP.NET technology. So for more notes on robots.txt text, please check http://www.googlechinawebmaster.com/2008/03/robotstxt.html
But how do we dynamically generate this file (there are actually quite a few needs for this)? Maybe what we immediately think of is I/O operations, writing a txt file in the root directory..., but there is actually another way: using a general processing program (.ashx file), the code is as follows:
code
<%@ WebHandler Language="C#" Class="Handler" %>
using System;
using System.Web;
public class Handler : IHttpHandler {
public void ProcessRequest (HttpContext context) {
HttpResponse response = context.Response;
response.Clear();
//response.ContentType = "text/plain"; If you want to use IE6 to view the page, you cannot make this statement for unknown reasons.
//The following two sentences should be dynamically generated by the database in actual use.
response.Write("User-agent: * n");
response.Write("Disallow: /news/231.html n");
//Refer to a static robots file content, which stores shielded content that will not change.
response.WriteFile("~/static-robots.txt");
response.Flush();
}
public bool IsReusable {
get {
return false;
}
}
}
Generally, the handler implements IHttpHandler. In the previous UrlRewrite section, we talked about HttpModule. In fact, in the ASP.NET application life cycle, there is a concept called "pipeline": an HTTP request passes through an There is a "filtering/processing" of HttpModule, which eventually reaches the "processor" part of an HttpHandle. HttpModule and HttpHandle form a "pipeline".
If you are unfamiliar with it, check the source code of Page and you will find that Page also implements IHttpHandler, so *.aspx files are the most commonly used HttpHandle. But Page is not only an HttpHandler, it also embeds complex page life cycle events, so from the perspective of saving resources, many times I can also use a custom, more lightweight *.ashx file () to Get some simple tasks done. Similar to generating a txt file, we can also generate verification codes (jpg files), xml files, etc.
Then the next thing to do is to URLRewrite:
code
void Application_BeginRequest(object sender, EventArgs e)
{
// Code that runs on application startup
HttpContext context = HttpContext.Current;
string currentLocation = context.Request.Url.LocalPath;
if (currentLocation.ToLower() == "/website1/robots.txt")
{
context.RewritePath("~/Handler.ashx");
}
}
In this way, the spider will think that there is indeed a robots.txt file in the root directory of the website.
2. Add the META tag to the page that needs to be blocked.
<meta id="meta" name="robots" content="noindex,nofollow" />
noindex means the page cannot be indexed
nofollow means that the page cannot be "followed" (will be explained in detail in SEO Hack)
This is the effect of a static page. If you need to generate it dynamically, it is quite simple:
code
protected void Page_Load(object sender, EventArgs e)
{
HtmlMeta meta = new HtmlMeta();
meta.Name = "robots";
meta.Content = "noindex,nofollow";
this.Header.Controls.Add(meta);
}
Description, keyword, etc. can also be specified in meta, and the technical implementation is the same.
So, how do we choose between the two methods? Some of my suggestions:
1. Try to use robots.txt, which can reduce the load of the website (although it is very small, haha), because after the spider views the robots.txt file, it will no longer request blocked related pages; and if you use the meta method , the spider must first request the page, and then make a decision not to retrieve it. At this time, the Http request has been sent, and the server-side resources have been wasted; in addition, if too much meta is blocked, it will also cause the spider to have a negative impact on the website. impression, reduce or abandon the retrieval and inclusion of the website;
2. The matching of robots.txt text is from left to right, so there is no regular matching here! So sometimes, we have to use the meta method. Such as the URL at the beginning of our article:
http://www.freeflying.com/blog/231.html
http://www.freeflying.com/news/231.html
Finally, some notes:
1. Do not use the same Keyword and Description on all pages. This is a mistake we can easily make. Although articles.aspx is one page, after adding url parameters, it becomes thousands of pages. If If you write the Keyword and Description on the page, then thousands of pages will have the same Keyword and Description!
2. Try to avoid using URL-based SessionID. When ASP.NET disables cookies on the client, you can set up a URL-based SessionID with a similar effect:
http://www.freeflying.com/(S(c3hvob55wirrndfd564))/articles.aspx