Generally speaking, whether a word or phrase can become a keyword in an article mainly depends on its ability to reflect the central idea of the article. The correlation between keywords and articles is mainly to illustrate how well a selected word or phrase can reflect the central idea or theme of the article for a given article. The extraction of keywords is affected by the position of the word in the article, the frequency of occurrence and the semantic characteristics of the word. So, how do search engines determine the correlation between keywords and articles? Here, the author starts from some of his own opinions and has some ideas, which should be used to inspire others and get everyone's guidance.
Personally, I think search engines should analyze keywords and article nature through the following steps:
First: The search engine first purifies the web pages to be analyzed.
Web page purification mainly removes a large number of useless advertisements, navigation bar and other web page template noise, as well as meaningless content, such as javaScript scripts, CSS tags and other content in the web page. As for what algorithm the search engine uses, we don’t know, but my personal estimate is that it divides the web pages into different blocks, determines the blocks containing thematic content by measuring the importance of the web page blocks, and then extracts As for the content of this block, as for how search engines determine the importance of web page speed, that is another topic.
Second: perform word segmentation processing on the extracted content
Personally, I think the search engine may have used some kind of algorithm to roughly segment the content into words, and first get N segmentation results with the highest probability; then, use the role annotation method to identify unregistered words and calculate their probabilities. Unregistered words are added to the segmented word graph, and then treated as ordinary words, and finally dynamic programming is performed to select N maximum probability segmentation annotation results. and record it.
Third: Remove meaningless words from the preliminary word segmentation results.
The search engine analyzes the word segmentation results in the second step and removes some non-substantial words such as modal particles and adjectives and some words. It also considers that the information expressed by single-word words is not complete enough and should be filtered out. Stop word removal is accomplished by building a stop word list. In this way, after removing these meaningless words, what remains are meaningful words worthy of analysis.
Fourth: Determine and analyze the weight of keywords
After completing the word segmentation and purification of the article, it is necessary to analyze all the keywords of the article. The author's idea is that the search engine represents the text as a IV-dimensional feature vector, and each dimensional component consists of keywords and their weights. It is generally believed that the determination of the weight of keywords in a text is mainly composed of three parts. Word frequency, position and word meaning jointly influence the decision. The impact of word frequency and position on words or phrases can be determined through certain algorithms, and word meaning weights are also analyzed and calculated using fixed algorithms. The search engine uses a set algorithm to calculate and analyze the above keywords. To get the final result.
The author believes that the search engine will obtain the final result after analyzing it through the above steps. The author here talks about his specific analysis method of the search engine, which is just his personal opinion:
First: Search engine weight based on keyword position
In a document, the location of a keyword plays an important role in determining the weight of a keyword on the page for search engines. For example, the domain name is considered by search engines to be the most fixed factor of the website. For example, a domain name containing the DVD keyword has an inherent advantage when users search for the keyword DVD. The title is the most valuable resource of the website. Search engines believe that the title is displayed in the browser title bar. Because it is displayed to users, it is the most important and concise summary of the file. Properly highlighting the proportion of keywords in the title is very conducive to improving rankings.
Second: Search engines are based on the frequency of keywords
The total number of different keywords in the web page is a very important aspect. Personally, I think that although the location and word frequency of keywords have a great influence on the weight of keywords, high word frequency does not determine whether the word is suitable as a keyword. To give a simple example, we are optimizing "United States" in an article. The frequency of the word is very high and the position where it appears is also very important. However, this word still cannot be given a higher weight because "United States" is also It appears widely in other documents. In these documents, "United States" also appears frequently and its location is also important. Therefore, words that have high frequency but are not suitable as keywords should be given less weight.
Third: The distance between important keywords in the document
Personal analysis, the distance between important keywords in the document should also be an important aspect to measure the relevance of keywords and articles.
The author believes that after the search engine performs the above series of processing, it will give the article a certain score for this keyword. When a user searches for a certain keyword, the chance that the article with a high score will be ranked first is much greater. Of course, This excludes the influence of external links. The above are some personal views on search engines, which are not necessarily correct. I hope I can learn from them together. Finally, the copyright of the article belongs to: Guangzhou Abortion Hospital: http://www.gzrlw.net/ . You are welcome to reprint it, but please do so. Please keep the link, thank you for your understanding and cooperation!
Thanks to siyi8473 for his contribution