Data Science--How does website data come from?

Author：Eve Cole Update Time：2011-06-30 17:19:16

If you want to analyze website data, you must first know where the website data comes from.

When users access the Internet, they will send service requests to the server. The request sent is recorded by the server in the server's log in a separate record. This is the most original website data log.

First look at the apache log

10.1.1.95 - user [18/Mar/2005:12:21:42 +0800] “GET /stats/awstats.pl?config=user HTTP/1.1″ 200 899 “http://10.1.1.1/pv/” "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Maxthon)"

The above is a standard log of apache.

This line of content consists of 9 items. In the above example, two items are blank, but the entire line of content is still divided into 9 items.

· The first piece of information is the address of the remote host. That is, the IP of the visitor's machine. The server sends reply information to the visitor based on this IP.

· The second item is blank, replaced by a "-" placeholder. In fact, this is true most of the time. This location is used to record the visitor's identification, which is not just the visitor's login name, but the visitor's email address or other unique identifier. This information is returned by identd, or directly by the browser. In the early days, this location often recorded the email address of the viewer. However, it didn't last long because some people used it to collect email addresses and send spam, and almost all browsers on the market removed this feature a long time ago. So, as of today, the chances of us seeing an email address in the second entry in the log are slim to none.

· The third item is also user. This location is used to record the name provided by the visitor when authenticating. Of course, if some content on the website requires the user to authenticate, this information will not be blank. However, for most websites that do not require login verification, this entry will still be blank in most records in the log file.

· The fourth item recorded in the log is the time of the request. This message is enclosed in square brackets and is in what is called "common log format" or "standard English format". Therefore, the log record in the above example indicates that the time of the request was March 18, 2005, 12:21:42. The "+0800" at the end of the time information indicates that the time zone of the server is 8 hours behind UTC. In fact, the time of domestic servers is +8000.

· The fifth piece of information in the log record is perhaps the most useful information in the entire log record. It tells us what kind of request the server received. The typical format of this information is "Method Resource Protocol".

In the above example, the method is GET. Other methods that may appear frequently are POST and HEAD. There are many possible legal methods, but these are the three main ones.

A resource refers to a document, or URL, that a browser requests from the server. In this example, the browser requested "/stats/awstats.pl?config=user".

The protocol is usually HTTP, followed by a version number.

· The sixth piece of information logged is the status code. It tells us whether the request was successful, or what error was encountered. Most of the time, this value is 200, which means that the server has successfully responded to the browser's request and everything is normal. Generally speaking, a status code starting with 2 means success, a status code starting with 3 means the user request was redirected to another location for various reasons, a status code starting with 4 means there is some kind of error on the client side, and a status code starting with 4 means there is some kind of error on the client side. Status codes starting with 5 indicate that the server encountered an error.

· The seventh entry in the log record represents the total number of bytes sent to the client. It tells us whether the transfer was interrupted (i.e. whether the value is the same as the size of the file). Adding up these values in the log records tells you how much data the server sent in a day, week, or month.

· The eighth item in the log record records the directory or URL where the customer was when making the request. This time it is "http://10.1.1.1/pv/", which is the home page under the pv directory of 10.1.1.1. In most cases, the home page will be a web file of the type and name specified after the DocumentRoot directive in httpd.conf.

· The ninth item in the log record represents the client's detailed information.

The above is an explanation of the apache log records.

Then switch to the IIS log, the records are similar, except that the login authentication returned by identd, because it has always been empty, has become the content of the cookie sent or received, and there are some additional sub-state contents of the protocol.

As you can see from the above, most of the data we analyzed can be obtained, but there are still some problems. When the user clicks the forward and back buttons on the browser, the client's browser reads the cache first, and only finds it in the cache. If not, it will re-request the server. Therefore, whether the server can remember the page after the user clicks back or forward depends entirely on the way the page is written and the status of the machine.

When using original logs for analysis, some small ifram and other pages will be requested separately, resulting in the number of requests to open a page not necessarily being 1. This is also some of the disadvantages of the original logs.

At the same time, these records are mainly for tracking server status and server security, and some data are not recorded.

· The relationship between pages is not recorded, and there is no relationship between which page the user accessed from.

· It is impossible to distinguish a certain visit from a user, especially for websites that are not required to be accessible.

· Page operations cannot be recorded, especially click operations.

So some websites have developed their own recording methods, usually using JS or a request for a one-pixel image to record this information.

In this way, several pieces of information are recorded, including the referrer of the source page visited, the session number, the cookie number, and the data generated by the click. And these data can be recorded directly into the database.

Using this method does reduce the difficulty of analysis and increases the information that can be analyzed, but it does sacrifice a certain degree of accuracy. It can be said that there are gains and losses.

· The first is recordable data. Since it is generated on the client, if a server error occurs, 100% of the data will be lost. The server does not respond at all, so how can the data be output? Moreover, since js needs to be started in order to transmit data, all data will be lost to a certain extent. Generally, when the server status is not bad, an accuracy rate of 98% is acceptable.

· The data of the source page will still be lost. Due to the relationship between page jumps and protocols, a certain amount of the source page will be lost. What is more troublesome is that https pages are transmitted using an encrypted protocol, regardless of No matter what method is used, it will be lost on the http page.

· It is greatly affected by the language and protocol of the page. Calls on the page, Ajax, js, etc. may affect the accuracy of the record.

· Finally, all pages must be added with code. Don’t underestimate this. If there are many pages, this is really a problem. If that page is forgotten, it will affect the overall data.

· The IP of the machine cannot be found. There are some differences between the IP at this point and the IP in the log. In some cases where multiple machines share an IP, what is recorded is not the IP on the user's final machine but the IP on the Internet access route. .

To sum up the above, as for website analysis, since the relationship between the data acquisition method and the website's own programming method is relatively complicated, you need to be more cautious when analyzing website data. Failures and traps in the data may occur at any time.

Article source: Lance's notebook