How do search engines determine whether the content of a page article is original?

Author：Eve Cole Update Time：2011-06-29 16:44:06

I am currently operating a non-mainstream website. The content is collected. The inclusion was okay at first, but it was banned soon afterward. Only a few dozen sites with tens of thousands of data are included in Baidu. Of course, I also know that collecting all the time is not an option, but with limited manpower, it is impossible to add them one by one, and it is also unrealistic. So I wanted to search how search engines determine whether it is original or not, but unfortunately, there is not much content on this aspect. Then I thought about it from the perspective of a search engineer, and I couldn't help but break out in a cold sweat, because it was too easy to determine whether it was original or not. I will analyze it in the order of my thinking for reference.

Let me use this article as an example to explain. Title: Nanhao Beijing Technology Co., Ltd. is a professional manufacturer of cursor readers. Content: The cursor reader developed by Nanhao Technology has fast card reading, excellent quality and good service. Our company address is in XXXX, Beijing. Spiders came to our website through hyperlink text and to this article page through in-site links. Search engine judgment analysis begins.

1. Analysis of the title. Many web pages now have obvious traces of optimization and contain a lot of long-tail words, but these long-tail words at the back should just tell the engine what the page is about, because in this case the engine will think there are too many Repeat, obviously this is an incorrect approach. In fact, there should be an interception function, for example, only the first 40 characters are intercepted as the analysis content. Finally, it is assumed that what the engine intercepts is: Nanhao Beijing Technology Co., Ltd. is a professional cursor reader.

The first thing to do is to judge whether this title is unique. How to judge? Don’t worry, there is a way. We all know that engine classification is based on word entries, so how do we get the entries? Simple: Related search term entries. As shown below:

The engine will analyze and match the intercepted titles one by one in its database according to the relevant search terms. For example, take the word "cursor reader" from the title, and then match it with related search terms. If this title already exists in the database, it will be considered that this title is not unique, and the article content needs to be matched. If the word cursor reader is matched, Nanhao Beijing will be intercepted again, and so on, and the matching will be carried out... until it has analyzed all the keywords that the engine thinks the title contains.

There are two final matching results for the title: First, the title database does not currently have this content, and the content needs to be investigated. Second, this content already exists in the title database and needs to be investigated.

2. Content analysis. The basic idea should be similar to the analysis of the title, but there are differences. After all, the information contained in the content is more complex than the title. It is more diverse and requires more complex algorithms.

As mentioned before, our content is: the cursor reader developed by Nanhao Technology has fast card reading, excellent quality and good service. Our company address is in XXXX, Beijing. Because the content of articles is generally very long, it is impossible to analyze keywords. He has to analyze and match a sentence or a paragraph. However, this matching range should still be analyzed and matched in the article database with relevant search terms in the title.

First, let’s talk about his analysis method in general: Randomly intercept random long fields, and then analyze the content before and after this field. If the current page and the engine content database have the same fields and the front and back paragraphs are also the same, it will be considered that this article has Plagiarism, suspicion of non-originality. This analysis process usually needs to be repeated several times. If you analyze it 10 times, 9 times there will be the same content in the existing content database before and after the intercepted field, plus the title is the same. In this case, your article will be It was deemed unoriginal.

Let's simulate it below.

The engine intercepted for the first time "Cursor reader reads cards quickly," and then came to the article database through related search terms. The existing database field was preceded by "Technology Research and Development", and the field after it was "Excellent Quality". Take out these two Fields are matched against our current page. If there is the same content, it is recorded as 0; if there is no similar content, it is recorded as 1. One match is completed.

Then intercept the "company address", perform the operation, and get a result of 0 or 1 again, and so on. Until the number of matching cycles set by the engine is completed. If you match 10 times and find the same content 7, 8, or 10 times, then your article will be considered not original...

Going further, if it is determined that this is an original article, then the engine will perform a +1 operation on the domain name in its domain name weight database. Obviously, as more and more original articles are published, the weight will be higher and higher, and the ranking will be higher. It's getting better and better. Such as A5, chinaZ.

I want to match the keywords between the title and the content. As long as there are enough matches and boldly expand the matching range of the relevant database, I can tell whether an article is original or not. In fact, today's processors are getting faster and cheaper. In addition, search engine engineers are all highly educated, algorithms have been improved, and experience has been accumulated. Search engines judge whether an article is original or not, as easy as chopping cabbage.

It’s okay if I don’t think about it, but I’m really shocked when I think about it. I came to the conclusion that the collection station must die! It should be original, or at least the title should be changed. Let’s take a look, and if I have time, I’ll share how to write pseudo-original articles that cannot be analyzed by engines.

The above is just my simple analysis. The actual algorithm is much more complicated after all. It is for reference only! Another AD: http://www.nanhaokeji.com . A website I operate is looking for friendly links. Corporate websites are preferred. The PR has just been updated. 1, QQ: 419844484, please indicate the friend link when adding friends.

Editor in charge: Chen Long Author feelingseas' personal space