Common codes for asp collection of HTML content, details on regular collection

Author：Eve Cole Update Time：2025-01-29 15:00:03

I have developed several collection programs and studied a lot of collection program codes, so I have a little understanding of the principles of collection programs. Let’s talk about the collection principle first:

The main steps of the collection procedure are as follows:

1. Obtain the content of the collected pages

2. Extract all data used from the acquisition code

1. Obtain the content of the collected pages

The commonly used ASP methods I currently know to obtain the content of the collected pages are:

1. Use the serverXMLHTTP component to obtain data

Copy the code code as follows:

Function GetBody(weburl)

'Create object

Dim ObjXMLHTTP

Set ObjXMLHTTP=Server.CreateObject(MSXML2.serverXMLHTTP)

'Request file, in asynchronous form

ObjXMLHTTP.Open GET,weburl,False

ObjXMLHTTP.send

While ObjXMLHTTP.readyState <> 4

ObjXMLHTTP.waitForResponse 1000

Wend

'Get the result

GetBody=ObjXMLHTTP.responseBody

'Release object

Set ObjXMLHTTP=Nothing

End Function

Calling method:

GetBody(URLf address of the file)

2. Or XMLHTTP component to obtain data

Copy the code code as follows:

Function GetBody(weburl)

'Create object

Set Retrieval = CreateObject(Microsoft.XMLHTTP)

With Retrieval

.Open Get, weburl, False, ,

.Send

GetBody = .ResponseBody

End With

'Release object

Set Retrieval = Nothing

End Function

Calling method:

GetBody(URLf address of the file)

The data content obtained in this way still needs to be encoded and converted before it can be used.

Copy the code code as follows:

Function BytesToBstr(body,Cset)

dim objstream

set objstream = Server.CreateObject(adodb.stream)

objstream.Type = 1

objstream.Mode =3

objstream.Open

objstream.Write body

objstream.Position = 0

objstream.Type = 2

objstream.Charset = Cset

BytesToBstr = objstream.ReadText

objstream.Close

set objstream = nothing

End Function

Calling method: BytesToBstr (data to be converted, encoding)' encoding is commonly used as GB2312 and UTF-8

2. Extract all data used from the acquisition code

The methods I currently have are:

1. Use ASP’s built-in MID function to intercept the required data

Copy the code code as follows:

Function body(wstr,start,over)

start=Newstring(wstr,start)

'Set the unique start tag of the data that needs to be processed

over=Newstring(wstr,over)

'Corresponding to start is the only end tag of the data that needs to be processed.

body=mid(wstr,start,over-start)

'Set the range of displayed pages

End Function

Calling method: body (content of the collected page, start tag, end tag)

2. Use regular expressions to obtain the required data

Copy the code code as follows:

Function body(wstr,start,over)

Set xiaoqi = New Regexp'set configuration object

xiaoqi.IgnoreCase = True'Ignore case

xiaoqi.Global = True' set to full text search

xiaoqi.Pattern = &start&.+?&over&'regular expression

Set Matches =xiaoqi.Execute(wstr)'Start executing configuration

set xiaoqi=nothing

body=

For Each Match in Matches

body=body&Match.Value 'Loop matching

End Function

Calling method: body (content of the collected page, start tag, end tag)

Detailed ideas for collection procedures:

1. Obtain the address of each page of the website's paginated list page

At present, most dynamic websites have rules for paging addresses, such as:

dynamic page

First page: index.asp?page=1

Second page: index.asp?page=2

The third page: index.asp?page=3

.....

static page

First page: page_1.htm

Second page: page_2.htm

The third page: page_3.htm

.....

To obtain the address of each page of the website's paging list page, you only need to use variables to replace the changing characters of each page address, such as: page_<%=&page&%>.htm

2. Obtain the content of the paginated list page of the collected website

3. Extract the URL connection address of the collected content page from the paging list code

Most content page links in paginated pages also have fixed rules, such as:

<a href=url1>Connection 1</a> <br>

<a href=url2>Connection 2</a> <br>

<a href=url3>Connection 3</a> <br>

Use the following code to get a collection of URL connections

Copy the code code as follows:

Set xiaoqi = New Regexp

xiaoqi.IgnoreCase = True

xiaoqi.Global = True

xiaoqi.Pattern = .+?

Set Matches =xiaoqi.Execute(page list content)

set xiaoqi=nothing

url=

For Each Match in Matches

url=url&Match.Value

4. Obtain the content of the collected content page, and intercept the data to be obtained from the collected content page according to the extraction mark.

Because it is a dynamically generated page, most content pages have the same HTML tags. We can extract the required parts of the content based on these regular tags.

like:

Each page has a webpage title <title>Webpage title</title>. You can use the MID interception function I wrote above to get the value between <title></title>, or you can use regular expressions to get it.

Example: body(<title>Webpage title</title>,<title>,</title>)