The TF-IDF algorithm has been well known by many professional SEO workers. It is a commonly used weighting technology for information retrieval and information exploration. When applied to web page analysis, it weights the relevant keywords in the web page and analyzes many web pages. The relevant web page keyword weights of a specific keyword in the ranking are given, and a scientific basis is given in the final sorting algorithm.
First take a look at the TF*IDF formula: TF*IDF value = TF×IDF (TF times IDF) = 1+log TF(t,d) ×IDF(t) = 1+log TF(t,d) ×log (N/DF(t)). Why should we analyze this formula? Because the greater the TF-IDF value of a web page, the more relevant the text content and index words in the web page are, the higher the weight it can obtain on the search engine, which can provide better ranking for later web pages. Great support.
TF term frequency (Term Frequency) in TF*IDF indicates the frequency of term occurrence in a document, while IDF inverse document frequency (Inverse Document Frequency) indicates that if the number of documents containing term t is fewer, the IDF will be larger. This shows that the entry t has good category discrimination ability. The IDF expressed by the formula can be written as: IDF(t) = log(N / DF(t)). DF(t) represents the number of documents containing a certain search term (represented by t), and N represents the total number of web pages on the Internet.
It is difficult to understand these concepts thoroughly. Let me give you an example so that you can understand them well.
Using TF-IDF to explain the "SEO diagnosis" ranking phenomenon
For example, for the web page ranking of the keyword "SEO diagnosis", we checked some word frequency display analysis of words related to this word on three of the top ten websites:
Ranked second is A5’s SEO diagnosis. Their word frequencies of “SEO” and “diagnosis” are 41 and 46 respectively, and the word frequency of “SEO diagnosis” is 20;
The website ranked third is a company in Changsha. Their word frequency of "SEO" and "diagnosis" are 12 and 4 respectively, and the word frequency of "SEO diagnosis" is 1;
My Smell the Rose blog ranks tenth. Among the websites, the word frequency of "SEO" is the highest, reaching 84, the word frequency of "diagnosis" is 7, and the word frequency of "SEO diagnosis" is 4.
Searching for "SEO diagnosis" shows about 1,530,000 pages. "SEO" and "diagnosis" are Baidu's upper limit of about 100,000,000, taking N=1000 billion. Therefore, the TF*IDF values of three keywords on three web pages are calculated as follows:
1. First calculate the IDF values of three words:
SEO: IDF= log(N / DF(t))= log(10000/1)=4
Diagnosis: IDF= log(N / DF(t))= log(10000/1)=4
SEO diagnosis: IDF= log(N / DF(t))= log(10000/0.015)= 7-log15≈6
2. Calculate the TF value of three words:
The TF value of keyword SEO for the three stations:
Changsha: TF= log(TF(t,d))= log12≈1.1
A5: TF= log(TF(t,d))= log41≈1.64
Smell the rose: TF= log(TF(t,d))= log84≈1.92
The TF value of keyword diagnosis for the three stations:
Changsha: TF= log(TF(t,d))= log4≈0.63
A5: TF= log(TF(t,d))= log46≈1.68
Smell the rose: TF= log(TF(t,d))= log7≈0.84
TF value of keyword SEO diagnosis for three stations:
Changsha: TF= log(TF(t,d))= log1=0
A5: TF= log(TF(t,d))= log20≈1.45
Smell the rose: TF= log(TF(t,d))= log4≈0.63
3. The TF*IDF values of three words from three websites are:
From the table above, we can clearly see that my blog "SEO" has the highest TF*IDF value, and A5 Webmaster Network's "Diagnosis" and "SEO Diagnosis" have the highest TF*IDF value.
If you look purely at the correlation calculated from the TF*IDF value, the ranking of the word "SEO diagnosis" is the highest and A5 Webmaster Network should get a better ranking. My blog should rank between the two. (the ranking the day before yesterday was indeed between the two), Changsha Station should be at the end, but there seems to be a certain gap with the actual results. This shows that there are other more important factors in website page ranking, such as the overall weight of the website, the weight and quality of individual web pages, external links, and user interaction (i.e. user experience), which we need to consider.
In addition, comparing the TF*IDF value of the same website, the Changsha station and my Xiaoxiangqiangwei blog need to improve their rankings. The requirements for the keyword "SEO" ranking are relatively high. The "SEO" ranking plays a decisive role, and A5 The ranking of "SEO Diagnosis" in the webmaster's website plays a decisive role, and the ranking of the keyword "SEO" has less impact on its ranking fluctuations. There is some basis for this. For example, the day before yesterday, my blog "SEO Diagnosis" ranked third. At that time, the "SEO" keyword ranked on page 10. Now it has dropped to page 23, and the ranking has dropped to tenth, so I use TF*IDF more. Research can help us discover many keyword ranking phenomena and formulate targeted SEO optimization strategies.
Of course, this calculation is based on an ideal state, but it can also explain the causes of some SEO phenomena. As long as we can master the basic idea of TF*IDF algorithm and then apply it to website optimization, we will definitely be able to better optimize the website, such as My blog, by reducing the impact of the word "SEO" on website rankings, may be able to better control the ranking of the keyword "SEO diagnosis" on the web page.
This article was published by Xu Ziyu, editor of Hangzhou SEO ( http://www.soxunseo.com ) Search Network. Everyone is welcome to reprint. Please keep this link when reprinting. Thank you for your cooperation!
(Editor: Yang Yang) Author Xu Ziyu’s personal space