I have developed several collection programs and studied a lot of collection program codes, so I have a little understanding of the principles of collection programs. Let’s talk about the collection principle first:
The main steps of the collection procedure are as follows:
1. Obtain the content of the collected pages
2. Extract all data used from the acquisition code
1. Obtain the content of the collected pages
The commonly used ASP methods I currently know to obtain the content of the collected pages are:
1. Use the serverXMLHTTP component to obtain data
Copy the code code as follows:
Function GetBody(weburl)
'Create object
Dim ObjXMLHTTP
Set ObjXMLHTTP=Server.CreateObject(MSXML2.serverXMLHTTP)
'Request file, in asynchronous form
ObjXMLHTTP.Open GET,weburl,False
ObjXMLHTTP.send
While ObjXMLHTTP.readyState <> 4
ObjXMLHTTP.waitForResponse 1000
Wend
'Get the result
GetBody=ObjXMLHTTP.responseBody
'Release object
Set ObjXMLHTTP=Nothing
End Function
Calling method:
GetBody(URLf address of the file)
2. Or XMLHTTP component to obtain data
Copy the code code as follows:
Function GetBody(weburl)
'Create object
Set Retrieval = CreateObject(Microsoft.XMLHTTP)
With Retrieval
.Open Get, weburl, False, ,
.Send
GetBody = .ResponseBody
End With
'Release object
Set Retrieval = Nothing
End Function
Calling method:
GetBody(URLf address of the file)
The data content obtained in this way still needs to be encoded and converted before it can be used.
Copy the code code as follows:
Function BytesToBstr(body,Cset)
dim objstream
set objstream = Server.CreateObject(adodb.stream)
objstream.Type = 1
objstream.Mode =3
objstream.Open
objstream.Write body
objstream.Position = 0
objstream.Type = 2
objstream.Charset = Cset
BytesToBstr = objstream.ReadText
objstream.Close
set objstream = nothing
End Function
Calling method: BytesToBstr (data to be converted, encoding)' encoding is commonly used as GB2312 and UTF-8
2. Extract all data used from the acquisition code
The methods I currently have are:
1. Use ASP’s built-in MID function to intercept the required data
Copy the code code as follows:
Function body(wstr,start,over)
start=Newstring(wstr,start)
'Set the unique start tag of the data that needs to be processed
over=Newstring(wstr,over)
'Corresponding to start is the only end tag of the data that needs to be processed.
body=mid(wstr,start,over-start)
'Set the range of displayed pages
End Function
Calling method: body (content of the collected page, start tag, end tag)
2. Use regular expressions to obtain the required data
Copy the code code as follows:
Function body(wstr,start,over)
Set xiaoqi = New Regexp'set configuration object
xiaoqi.IgnoreCase = True'Ignore case
xiaoqi.Global = True' set to full text search
xiaoqi.Pattern = &start&.+?&over&'regular expression
Set Matches =xiaoqi.Execute(wstr)'Start executing configuration
set xiaoqi=nothing
body=
For Each Match in Matches
body=body&Match.Value 'Loop matching
Next
End Function
Calling method: body (content of the collected page, start tag, end tag)
Detailed ideas for collection procedures:
1. Obtain the address of each page of the website's paginated list page
At present, most dynamic websites have rules for paging addresses, such as:
dynamic page
First page: index.asp?page=1
Second page: index.asp?page=2
The third page: index.asp?page=3
.....
static page
First page: page_1.htm
Second page: page_2.htm
The third page: page_3.htm
.....
To obtain the address of each page of the website's paging list page, you only need to use variables to replace the changing characters of each page address, such as: page_<%=&page&%>.htm
2. Obtain the content of the paginated list page of the collected website
3. Extract the URL connection address of the collected content page from the paging list code
Most content page links in paginated pages also have fixed rules, such as:
<a href=url1>Connection 1</a> <br>
<a href=url2>Connection 2</a> <br>
<a href=url3>Connection 3</a> <br>
Use the following code to get a collection of URL connections
Copy the code code as follows:
Set xiaoqi = New Regexp
xiaoqi.IgnoreCase = True
xiaoqi.Global = True
xiaoqi.Pattern = .+?
Set Matches =xiaoqi.Execute(page list content)
set xiaoqi=nothing
url=
For Each Match in Matches
url=url&Match.Value
Next
4. Obtain the content of the collected content page, and intercept the data to be obtained from the collected content page according to the extraction mark.
Because it is a dynamically generated page, most content pages have the same HTML tags. We can extract the required parts of the content based on these regular tags.
like:
Each page has a webpage title <title>Webpage title</title>. You can use the MID interception function I wrote above to get the value between <title></title>, or you can use regular expressions to get it.
Example: body(<title>Webpage title</title>,<title>,</title>)