[Preface] There are actually many ways to obtain data for website analysis. For example, use server log data, or install some monitoring software on the client. The method of obtaining data for website analysis using the page tagging method is different from the previous two methods, but once it was born, it shocked everyone and quickly became the mainstream method. In fact, almost all topics on my blog ( http://www.chinawebanalytics.cn ) are based on page markup. Today’s article will take friends to once again understand what page tagging website analysis is, and how the data in the Omniture Site Catalyst or Google Analytics website analysis reports we read every day is captured.
Because I am on a business trip, I have less time to blog. This article is an excerpt from a book I’m currently writing on the basics of website analytics. I hope this book will be available to everyone next year.
【text】
When it comes to data capture for website analysis, everyone should first have a preliminary knowledge, that is, the fundamental principles of page markup website analysis and log method website analysis are completely different. Regarding the principles of website analysis using log methods, please read this post: Principles, Advantages and Disadvantages of Server Log Method Website Analysis. A friend previously left a message on Weibo, thinking that AWStats, Omniture, and WebTrends are all log analysis tools, but Omniture uses the asp method, so they are no different. This view is completely misunderstood. In fact, all three tools are different. AWStats is a log analysis tool, free. WebTrends was originally a pure log analysis tool, but later added the function of Page Tagging. Omniture SiteCatalyst was born as a tool based on Page Tagging, and so far Omniture does not have a tool for log analysis.
Therefore, today we will only talk about the principle of obtaining data through website analysis using page tagging. Let's start with a game.
What is page markup
Have you all played Blizzard's game StarCraft (StarCraft Generation 1)? I'm a big fan of this game. The Queen of the Zerg has a special ability to spray a parasite on an enemy action unit. In this way, wherever the action unit goes, the situation around it can be clearly seen by the Zerg. A very loyal spy.
Or, everyone has been to the bank. The cameras placed everywhere in the bank actually filmed every move we made, and then transferred them to the storage device for storage.
Therefore, the inappropriate metaphor, the so-called page markup, is like a parasite that is "sprayed" on the page, or a camera installed on the page, recording every move of the visitor on the page, and then passing it on to relevant An organization or individual who needs to know about this website.
The figure below represents this process:
The page tag is like a small red piece in the picture. It is actually a JavaScript program statement that can be executed by the browser and is placed in the HTML source file of the page. In this way, when the page is downloaded to the client's browser, the Javascript program marked in this page will be executed, just like a parasite in StarCraft, or the camera is turned on.
After the JavaScript code of the page mark is executed, the visitor's interactive access behavior on the page will be faithfully and continuously sent to the server of the website analysis tool corresponding to the page mark. This is the same as the camera sending the captured image to The image storage server is exactly the same. After the website analysis tool server receives the data, it will further process the data and translate the data into graphics, tables and data files that people can read and analyze, and then present it on a beautiful user interface. Our commonly used Google Analytics is such a data collection method.
As you can see, the page marking method is fundamentally different from the logging method.
1. The logging method is to extract the data from the log file for analysis; while the page tag requires artificially adding a small "spy unit" to the page, which means that it needs to rely on a third party to obtain the data.
2. Because of this additional little "spy unit", the page marking method needs to modify the HTML source file of the page, but the logging method does not.
3. The logging method passively waits for you to process the data. If you don’t process it, the data will be a faithful and rigid record. The page marking method actively sends data and will automatically preprocess the data and wait for you. to analyze.
Let’s talk about a little history here. In the early days of the Internet, websites were small in size and simple in structure, and the logging method dominated the world. However, the Internet developed too fast, and the software, hardware and logical architecture of the website quickly became more and more complex. There are many problems that need to be overcome with the logging method. The difficulties are increasing, the difficulty of implementation is increasing exponentially, and people need to find an easier way to achieve it. With the popularity of JavaScript and the emergence of SaaS (Software as a Service, Software as a Service), the page markup method emerged. This method is simple to implement, and there is no need to deal with massive log file records, data management and The processing efficiency has been greatly improved, and it has quickly become the first choice of many webmasters. Precisely because of its many advantages, such as simplicity, high data readability, and low management difficulty, the page tagging method has become the mainstream data acquisition method in the science of website analysis. My blog also focuses entirely on this method rather than Logging methods will be discussed in detail.
Interesting reading: The difference between monitoring codes and monitoring tags
In the specific practical activities of website analysis, we often mix two different tracking tag methods - Tracking Code and Tracking Tag. But in fact they are different things, and if we can strictly distinguish them, it will help us communicate more accurately.
Code refers to statements in an executable program, so monitoring code refers to an executable program statement written for monitoring purposes. The most typical monitoring code is the Google Analytics JavaScript monitoring code we add to the page.
Tag refers to an identifier added to identify a monitoring object. This identifier is not a program statement and cannot be executed, but it can be recognized by the program and used to determine the specific attributes of the monitoring object. For example, this is a URL: http://www.chinawebanalytics.cn/?utm_campaign=newbook&utm_source=tsinghua&utm_medium=PRess , "?utm_campaign=newbook&utm_source=tsinghua&utm_medium=press" is a label. The tag can also be a complete URL.
To put it simply, the program that can be executed is the monitoring code, and the program that cannot be executed is the monitoring label.
How the page markup method works
We have already understood the basic principles of the page markup method, and now we need to learn in detail how page markup can collect, transmit and finally present data to us. Understanding this process is very helpful for us to carry out specific monitoring implementation of website analysis.
Step 1: The page monitoring code is loaded and executed by the browser
The prerequisite for the page tagging method to work properly is to add a piece of JavaScript monitoring code to every page that needs to be monitored on the website. When the user opens this page, the server (or Cache) will respond to the user's request, and then pass the page, together with the monitoring code, to the user's browser. When the user's browser receives the monitoring code, it will begin executing the code.
Step 2, execute the complete monitoring code
After the monitoring code on the page is executed, it does not realize all monitoring functions, but instead requests the complete monitoring code from the server of its corresponding website analysis tool. The complete monitoring code statement has a large amount, so it is collected into a .js file and stored outside the web page. Once the external code receives a request from the page monitoring code, it will also be passed to the browser and executed by the browser. In this way, complete monitoring functions can be realized.
Taking the GA monitoring of my own blog (CWA, Web Analytics in China, http://www.chinawebanalytics.cn ) as an example, during the execution of the complete monitoring code, several things will happen:
1. Detect various attributes of the client, including browser version, operating system version, screen resolution, etc., and record the specific time when the page access occurs, the source of the access (Traffic Source), etc.
2. Create a cookie for this user's browser. What are cookies? Please see this post: Defending Cookies - Without Cookies, We Have Nothing, and this post: How Much Impact Do JavaScript and Cookies Have on GA? . If you don't want to read these two articles, it doesn't matter. Simply put, the function of cookies is to record the key information related to the user's visit to this website. The next time the user browses this website again, the record in the cookie will be used as a new The reference of browsing records allows website analysis tools to determine whether this visit is a repeat visit, whether the visitor is a new visitor, and many other important data. Cookies are required in the page markup detection method, which means that if the browser disables cookies, the page markup method will not work. To know about Google Analytics cookie settings, please see this article: Website Analytics Metrics, Their Meanings and What You Don’t Know (2).
3. If a cookie has been set up for this visitor's browser before, the monitoring code will rewrite the parts of the old cookie data that need to be updated, thus ensuring that each cookie records the corresponding access behavior data.
Step 3: Send data to the server of the website analysis tool
When the monitoring code has collected all the information, it will transmit the relevant data back to the website analysis tool's server. The method of transmission is not to send the data directly (that is, not to use the post method. If you do not understand the post and get methods in the HTTP protocol, you can skip the content in brackets), but to send the data to the website analysis tool server. This is done by requesting a 1×1 pixel transparent GIF image (that is, still using the get method, if you don’t understand, please skip it). Seems a little strange, right? In fact, when issuing this 1×1 pixel request, all the collected data are sent to the server of the analysis tool as relevant parameters of this request, so that the analysis tool can obtain and store the relevant data.
Step 4, the website analysis tool server records data
After the website analysis tool server receives the data, it will store the data in a large data file. The recording method of this data file is very similar to the log file (Log File) we mentioned earlier. Therefore, here we also call it It is a Log File, but the difference is that the Log File here does not contain the operating data of the website analysis tool server itself, but the data of the monitored website.
Each data line (a data entry) in this Log File file contains a lot of information about a certain page view (PageView), including but not limited to the following (take the Google Analytics Log File record file as an example):
1. The date and time when the page access occurred;
2. The title of the page visited;
3. The source of the visitor (whether it is linked from a certain website, through a search engine, through direct access, etc.);
4. The number of times this visitor visits this website;
5. The geographical location of the visitor’s IP address;
6. Visitor client attributes, such as operating system, browser, screen resolution, etc.
Once these records are included in the logs of the analysis tool server, the data collection process is complete. The following example is a row of data recorded in the Google Analytics server (please note that it is not real data):
123.121.215.51 www.chinawebanalytics.cn – [31/Jan/2010:20:45:26 -0600] "GET
/__utm.gif?utmwv=1&utmn=699988832&utmcs=utf-8&utmsr=1680×1050&utmsc=32-bit&utmul=enus&
utmje=1&utmfl=8.0&utmcn=1&utmdt=%E7%BD%91%E7%AB%99%E5%88%86%E6%9E%90%E5%9C
%A8%E4%B8%AD%E5%9B%BD%E2%80%94%E2%80%94%E4%BB%8E%E5%9F%BA%E7%A1%80
%E5%88%B0%E5%89%8D%E6%B2%BF&utmhid=2006742654&utmr=-
&utmp=/ HTTP/1.1" 200 35 " http://www.chinawebanalytics.cn/ " "Mozilla/5.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
"__utma=453698521.699988832.235456888.235456888.235456888.1; __utmb=453698521;
__utmc=453698521;
__utmz=453698521.235456888.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none)"
The above data seems messy, but in fact some clues can be seen. For example, we can see that the visitor's IP address is 123.121.215.51, the domain visited is my blog www.chinawebanalytics.cn , and the time the visit was initiated was 8:45:26 pm on January 31, 2010. In addition, if you look back, you can also see information about the operating system and browser used by the visitor.
As for what utma, utmb, utmc and utmz all stand for, you will understand after reading this article: Website analysis measurement, significance and unknown (2).
Step 5, website analysis tool processes data
Once the data is recorded in the Log File of the website analysis tool server, the pipeline will continue to go down. The next step is to process the record lines in these Log Files. Each record line contains specific data elements, called fields, such as visitor IP, access time, browser and its version, etc.; these The data elements will be broken up separately and then stored in the corresponding fields, becoming the "semi-finished product" for our final viewing of the data.
Then, the semi-finished data will be further filtered by artificially set criteria in the website analysis tool. Data fields that cannot be filtered will be excluded, and the remaining data will be further arranged in projects prepared for generating reports. All this data is stored in specialized databases of website analysis tools, waiting to be extracted and used at any time.
Step 6, generate report
When the data has been processed, the whole process is coming to an end. If a user requests a specific report using a website analytics tool, the data fields are further calculated, organized, and arranged into projects in preparation for generating the report, organized in a predefined (or user-defined) format. We cannot see this process, but it contains the subtlety of a website analysis tool algorithm. Moreover, the definition of the algorithm also affects the definition of some basic website analysis metrics, which directly affects the output of the actual values of the basic metrics. This is also an important reason why different website analysis tools bring different values when counting the same website.
Subsequently, the prepared data items are further pushed forward to the server of the website tool's UI (User Interface) to generate specific graphs, tables and figures, which are then further output to the user's browser or client. , and become a report that we can easily understand.
The whole process is actually not complicated, but website analysis tools will face a large amount of data processing. Especially when the traffic of a website is particularly large, website analysis tools will bear a heavy load. This is why many web tagging website analysis tools charge fees based on the traffic of the monitored website.
Advantages of using page tagging method for website analysis
Page tagging has many advantages, making it a mainstream method of obtaining data for website analysis.
1. Not afraid of cache impact
Contrary to the logging method, which is afraid of the impact of caching, the page markup method does not have to worry about caching at all. Because the code of the page markup is placed in the page source file, even if the page is cached by the proxy server or saved by the client's browser cache, the code of the page markup will also be saved and will be included when the browser loads the page. be executed.
Therefore, if you enter several pages of a website in succession, and then click the "Back" button of the browser to return to the previous page, then under the page marking method, the act of returning to the previous page will increase the page by one "Page View"; however, under the log file method, a new page view may not be recorded due to the impact of caching. In this way, the page tagging method can record the visitor's journey more accurately.
2. Ability to record “client interactions”
As mentioned before, page markup is implemented by executing JavaScript code on the client. Therefore, in theory, "every move" on the page opened by the browser can be recorded. For "client-side interaction" type Flash, JavaScript or other web2.0 applications, page markup can also mark various interactions of these applications, and then accurately record the occurrence of these interactions.
As web pages become more and more interactive, the advantages of page markup will become very obvious. Moreover, there are already many tools that use page markup to directly serve client interactions on the page, which shows that client interaction monitoring The requirement is no longer optional and has become an important part of measuring website performance.
3. Relatively accurate visitor records
Page tagging relies on cookies to record and identify visitor information. Some page tagging tools use cookies and IP to jointly identify visitor information, while logging methods only rely on specific IP addresses.
It should be emphasized that using cookie methods to identify visitor information is also impossible to be 100% accurate (in fact, perfection does not exist. Stephen Hawking said that 100% perfection does not exist in the universe. Otherwise, the universe will will not exist), but compared to relying solely on IP addresses, cookies add an identification mechanism after all, and this mechanism is bundled with the client's browser and stores more identification information, so visitors who use cookies to record The records are definitely more accurate than the IP visitor count. To be fair, until a new method is found (which is not yet heard of), the page marking method using cookie technology can provide the most accurate visitor data at present.
In addition, the page tagging method is not affected by robots or spiders that visit the website to crawl the website data. Therefore, excluding malicious cheating, it can be considered that all the data recorded by this method is the data of "people" visiting the website. Especially for a non-commercial website like my own blog, I don't really care about robots crawling my website. However, if you have very advanced needs for SEO, then you should use log analysis software to view the website of search engine robots.
4. Better real-time performance
Like the logging method, the page tagging method also collects data in real time. A visit occurs, triggers the markup on the page, and the data is fetched and sent to the tool's server. But unlike the log method, the data processing of the log method is not real-time. After the data of the page mark method is transmitted to the tool's server, it is processed in a short time (even in real time) and then forms a report. Therefore, the page tagging method has quite good real-time performance. For example, Omniture's SiteCatalyst data reports only have a delay of a few hours; in the past, Google Analytics had a delay of one to two days, but now it is only a few hours. Such data delays have little impact on analysis and can be approximated. Think it's real time.
5. Data storage and transfer issues no longer exist
Unlike the logging method, which requires saving a large number of log files, the data of the page markup method can be stored entirely on the website analysis tool provider's server (tool server) if you wish, which means the additional hardware cost and cost of purchasing a log storage device. The cost of software to manage log files is gone. In addition, a trouble that is also saved is the work of inputting log files into log file analysis software. Sometimes, this work is not as simple as using the mouse to click on a file in the import interface of the tool, but requires developing Specialized program. In addition, when there are mirror servers and other situations, the page markup method can actually be ignored, but the log method is not so simple in merging data.
Okay, this week's homework has been handed in to everyone, and now it's everyone's turn. I really want to see your comments and comments. I wish you all a happy new week!
Author: Song Xing
Article source: http://www.chinawebanalytics.cn/pag-tagging-data-acquire/