Duplicate content: causes and solutions

Author：Eve Cole Update Time：2010-12-16 17:49:00

Search engines like Google have a problem. They call it "duplicate content." Your content is displayed on various pages of the website, and they don't know which address to display it on. This duplicate content is especially problematic when people link to all the different versions of your content. The purpose of this article is to help you understand the different causes of duplicate content and then find out how to fix each one.

Reasons for duplicate content

1. Misunderstanding the concept of URL

2. session ID's

3. URL tracking parameters

4. Content search & content aggregation

5. Parameter order

6. Comment pagination

7. Print page

8. www vs. no www

Conceptual workaround "canonical" tag

1. Identify duplicate content

2. Google Webmaster Tools

3. Search command to query title

Practical steps to resolve duplicate content

1. Avoid duplicate content

2. 301 redirect

4. Use the rel="canonical" tag

5. Link to original content

Summary: Duplicate content can and should be addressed

You can think of duplicate content as if you are standing at a crossroads and there are two different directions on the road signs pointing to the same destination. Which way should you take? What's more, the situation is even worse when your destination is different. As a reader, you don't care where the content comes from, but the search engines have to pick one to display in the search results because they don't want to display the same content twice.

For example, content about keyword -x/ , this situation is not fictitious, this problem exists in many cms systems. For example, your article has been collected and reposted by some netizens, some people link to your first URL, and some people link to your second URL. This is where the problem of duplicate content comes in. If all links about this keyword point to one URL, the probability of the keyword being on the homepage will be much higher.

Causes of duplicate content

There are many factors that can lead to duplicate content. Most of it is technical, it's not very common for someone to decide to put the same content in two different places without citing the original source, and most people would find it uncomfortable. The technical reasons are also very good. Most of the reasons for it are that programmers do not stand from the perspective of browsers or users, and do not care about search engine spiders, but just follow the programmer's thinking. Suppose the article mentioned earlier appears at http://www.example.com/keyword-x/ and http://www.example.com/article-category/keyword-x/ ? If you ask the programmer, he will say that it only appears once.

Misunderstanding the concept of URL

So are programmers crazy? No, no, he just spoke another language again. The entire website you see is probably database driven. In that database, there is only one article, and the website program allows an article in the database to be accessed through different URLs. Because in the eyes of programmers, the only sign is that the articles in the database have unique IDs, not URLs. For search engines, the URL is the unique identifier of an article. If you tell your programmer this, he will understand the cause of the problem, and then he, like most programmers I work with, will wonder why search engines are stupid and why he can't solve this problem. In this way, he went to another wrong thinking.

Session ID

You often want to track the movements of your visitors, such as storing the items they purchase in their shopping cart. To do this, you must give them a session. A session is basically a brief history of what a visitor has done on your site, which may include content such as the items in a shopping cart. In order to keep the visitor's session from clicking one web page to another, it needs to be stored somewhere. The most common solution is cookies, however search engines usually do not store cookies.

What happens at this point is that some website systems use the session ID in the URL to return. At this point, the internal links on each website will be appended with the session ID, and the session ID is unique, which generates a new URL and therefore duplicate content.

Using tracking and sorting URL parameters

Another cause of duplicate content is the use of URL parameters, although parameters do not change the content of the page, such as in tracking links. You will see that http://www.example.com/keyword-x/ and http://www.example.com/keyword-x/?source=rss are not the same URLs to search engines. The latter may be allowing you to track sources, which may make your ranking a bit harder, a very bad negative effect.

This not only applies to tracking parameters, it also applies to every parameter you add after the URL without changing the actual content on your page. Involving parameters will change the order of products on the web page or display another sidebar, which will lead to duplicate content.

Content scraping and content aggregation

While the vast majority of the cause of duplicate content is your own fault, or at least your site's, sometimes it's another site's use of your content without your consent. They don't always link to your original content, and search engines don't know that they'll have to process another version of your same article.

As your site becomes more popular, you'll encounter more and more crawls, and your problems will get worse.

Parameter order

Another common reason is that CMS does not use concise and clean URLs, but uses /? id=1&cat=2, the ID here indicates the article, and cat indicates the category. In most website building systems, this URL /? cat=2&id=1 will also render the same content, but to search engines they will be completely different.

Comment pagination

There is an option to paginate comments in the wordPRess system or other programs. This will result in the duplication of the article content in the URL of the article itself and the article URL+/comment-page-1/, /comment-page-2/, etc.

Print page

If your cms generates print pages and links from your article pages, in most cases Google will find these pages unless you specifically prevent them. Which version should Google show? A page with ads and surrounding content or a page with just your articles.

WWW vs. non-WWW

This is an old question, but sometimes search engines still confuse WWW vs. non-WWW duplicate content when both versions are accessible.

An uncommon situation is http vs https duplicate content.

Conceptual workaround "canonical" tag

As identified above, duplicate content resulting from different URLs for the same content is a problem, but it can be solved. A human being can usually easily tell you what the correct URL for an article should be when publishing an article. The funny thing is sometimes you ask 3 people in the same company and you get 3 different answers.

Summary: Duplicate content can and should be addressed

In these cases the problem needs to be solved because at the end there can only be one URL. The correct URL of the article can be named Canonical by search engines.

Identify duplicate content

You may not know if you have duplicate content on your site. Let me give you some tips.

Google Webmaster Tools

Google Webmaster Tools is a great tool for identifying duplicate content. If you go to Google Webmaster Tools and view your site, check Diagnostics -> HTML Suggestions, you'll see something like this

If a page has a duplicate title or duplicate description, there's almost nothing good going on there. Clicking it will show you which URLs have duplicate titles or descriptions to help you identify the problem. The problem is that if you have an article about "keyword -X" displayed in two categories, their titles may be different. For example, their titles would be "Keyword X - Category X - Example Site" and "Keyword X - Category Y - Example Site". Google won't treat them as duplicate titles, but you can find them by searching.

Find title or other fragments

There are several search operators that are very helpful in this situation. If you want to find all the URLs of articles on your website that contain your keyword "X", you can enter the following command in the Google search box:

site:example.com intitle："Keyword X"

Copy code

Google will show you all the pages on example.com that contain that keyword. The more specific keywords you search in the title section, the easier it will be to find duplicate content and eliminate them. You can also use the same method to identify duplicate content of yours on other people's websites. For example, the complete title of your article is Keyword X - why it is awesome, you can search

intitle："Keyword X - why it is awesome"

Copy code

Google will return all websites containing this title. Sometimes you can also search for one or two sentences in your article, because some content scrapers may change your title. In some cases, when you search, Google may display the following prompt at the end of the results:

This is a sign that Google has removed duplicate data results. This is obviously not good. You can continue to click in to view other results to see if it helps you solve these problems.

Practical steps to resolve duplicate content

Once you decide which URL is the one that should be canonicalized for your article, you have to start some canonicalization process (ok I know I'm verbose and have said this several times). This basically means letting search engines know and have them discover this canonical version as quickly as possible. Here are four ways to solve this problem:

1. Don’t create duplicate content

2. Redirect duplicate content to canonical URLs

3. Add a link Canonical tag to the duplicate page

4. Add a hyperlink to the canonical URL on the duplicate content page

avoid duplicate content

For the above causes of duplicate content, there are some easy ways to fix them.

1. Session ID in URL?

Usually you can cancel it in your system settings

2. There is a printed page

None of this is necessary, you can use a print stylesheet

3. Use comment pagination in wordpress

For this problem, can you cancel the comment pagination in the settings?

4. Different orders of parameters

Tell your programmer to build a code to sort in the same order (this usually refers to the URL factory)

5. Tracking parameter issues

In most cases you can use hash values # instead of parameters to track marketing campaigns

6. WWW vs non-WWW issue

Choose the version you want and stick with the redirection. You can set this in Google Webmaster Tools.

If your problem isn't that easy to solve, it might still be worth the effort to prevent outright duplicate content. This is the best solution so far.

301 redirect duplicate content

In some cases, it's not possible to completely prevent the system article content you use from generating incorrect URLs, but you can redirect them. If this doesn't make sense to you (and I understand) you must remember to mention this to your programmers. Additionally, if you resolve a duplicate content issue, make sure you redirect all old duplicate content URLs to the appropriate canonical URLs.

Use rel="Canonical"

Sometimes you don't want to or can't solve the problem of duplicate articles, but you have to know that it is the wrong URL. For this particular problem, search engines also introduced this

Canonical elements. The part where it's placed on your website looks like this:

This process will be slower than a 301 redirect, so it would be preferable if you use 301 Google's John Mueller mentioned http://www.seroundtable.com/google-canonical-tag-vs-301-redirect-12611.html

Link back to your original version

If you can't do the above, you probably don't have control over the "head" section of your site where content is displayed, and it's also a good idea to add a link to your original page at the top or bottom of the page. There are some other articles you want to add a link to in your RSS feed. Some crawlers may filter out these links, but some may stay there, and if Google counts several links pointing to your article it will also quickly know that this is the exact canonical version of the article.

Summary: Duplicate content can and should be addressed

Duplicate content happens everywhere. I have yet to come across a website with over 1000 pages without a single bit of duplicate content. This requires your attention at all times. This is solvable and the rewards can be great. Your high-quality content page rankings may skyrocket after you address duplicate content. Of course, first you need to identify these problems, help your programmers come up with solutions to the problems, and even help you solve the problems.

Translation author: zhipeng

Article source: Lightyear Forum ( http://www.gnbase.com/thread-474-1.html )

Original English text: http://yoast.com/articles/duplicate-content/

Note: The article is reprinted in Webmaster Home with the authorization of zhipeng, Lightyear Forum. If you need to reprint, please indicate the source and link of the article.