Four Applications of Regular Expressions in Web Page Processing

Author：Eve Cole Update Time：2009-06-30 16:08:08

Regular expressions (Regular Expression) provide an efficient and convenient method for string pattern matching. Almost all high-level languages provide support for regular expressions or provide ready-made code libraries for calling. This article takes common processing tasks in the ASP environment as an example to introduce the application skills of regular expressions.

1. Check the format of passwords and email addresses.

Our first example demonstrates a basic function of regular expressions: abstractly describing arbitrarily complex strings. What it means is that regular expressions give programmers a formal string description method that can describe any string pattern encountered by the application with only a small amount of code. For example, for people who are not engaged in technical work, the requirements for password format can be described as follows: the first character of the password must be a letter, the password must be at least 4 characters and no more than 15 characters, the password cannot contain other than letters, numbers and Characters other than underscore.

As programmers, we must convert the above natural language description of password format into other forms so that the ASP page can understand and apply it to prevent illegal password entry. The regular expression describing this password format is: ^[a-zA-Z]w{3,14}$.

In ASP applications, we can write the password verification process as a reusable function, as shown below:

Function TestPassword(strPassword)
Dim re
Set re = new RegExp

re.IgnoreCase = false
re.global = false
re.Pattern = "^[a-zA-Z]w{3,14}$"

TestPassword = re.Test(strPassword)
End Function

Below we compare the regular expression for checking the password format with the natural language description:

the first character of the password must be a letter: the regular expression description is "^[a-zA-Z]", where " ^" indicates the beginning of the string, and the hyphen tells RegExp to match all characters in the specified range.

The password must be at least 4 characters and no more than 15 characters: the regular expression description is "{3,14}".

The password cannot contain characters other than letters, numbers, and underscores: the regular expression description is "w".

A few notes: {3, 14} means that the previous pattern matches at least 3 but no more than 14 characters (plus the first character, it becomes 4 to 15 characters). Note that the syntax within curly braces is extremely strict and does not allow spaces on either side of the comma. If spaces are added, it will affect the meaning of the regular expression and cause errors during password format verification. In addition, the "$" character is not added at the end of the above regular expression. The $ character causes the regular expression to match the string until the end, ensuring a valid password without any additional characters following it.

Similar to password format verification, checking the legitimacy of email addresses is also a very common problem. Using regular expressions to perform simple email address verification can be implemented as follows:

<%
Dim re
Set re = new RegExp

re.pattern = "^w+@[a-zA-Z_]+?.[a-zA-Z]{2,3}$"
Response.Write re.Test(" [email protected] ")
%>

2. Extract specific parts of the HTML page

The main problem faced in extracting content from the HTML page is that we must find a way to accurately identify the part of the content we want. For example, the following is an HTML code snippet that displays news headlines:

<table border="0" width="11%" class="Somestory">
＜tr＞
＜td width="100%"＞
＜p align="center"＞Other content...＜/td＞
＜/tr＞
＜/table＞
＜table border="0" width="11%" class="Headline"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞War in Iraq! ＜/td＞
＜/tr＞
＜/table＞
＜table border="0" width="11%" class="Someotherstory"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Other content...＜/td＞
＜/tr＞
＜/table＞

Observing the above code, it is easy to see that the news title is displayed by the table in the middle, and its class attribute is set to Headline. If the HTML page is very complex, you can use an additional feature provided by Microsoft IE starting from 5.0 to view only the HTML code of the selected part of the page. Please visit http://www.microsoft.com/Windows/ie/WebAccess/default.ASP Learn more. For this example, we assume that this is the only table whose class attribute is set to Headline. Now we need to create a regular expression, find this Headline table through the regular expression and include this table into our own page. The first is to write code that supports regular expressions:

<%
Dim re, strHTML
Set re = new RegExp ' Create a regular expression object

re.IgnoreCase = true
re.Global = false 'End search after first match
%>

Now consider the area we want to extract: here, what we want to extract is the entire <table> structure, including the closing tag and the text of the news title. Therefore, the starting character of the search should be the <table> start tag: re.Pattern = "<table.*(?=Headline)". This regular expression matches the start tag of the table and can return everything between the start tag and "Headline" (except newlines). Here's how to return the matched HTML code:

'Put all matching HTML codes into the Matches collection
Set Matches = re.Execute(strHTML)

'Display all matching HTML codes
For Each Item in Matches
Response.Write Item.Value
Next

'Show one of the
Response.write Matches.Item(0).Value

runs this code to process the HTML fragment shown earlier. The regular expression returns the content of a match as follows: <table border="0" width="11%" class=". Regular expression The "(?=Headline)" in the expression does not get the characters, so the value of the table class attribute cannot be seen. The code to get the rest of the table is also quite simple: re.Pattern = "<table.*(?=Headline) (.|n)*?＜/table＞". Among them: "*" after "(.|n)" matches 0 to more any characters; and "? ” minimizes the “*” matching range, that is, matches as few characters as possible before finding the next part of the expression. ＜/table＞ is the end tag of the table.

The “?” limiter is very important, it prevents the expression from returning Code for other tables. For example, for the HTML code snippet given above, if you delete this "?", the returned content will be:

<table border="0" width="11%" class="Headline">
＜tr＞
＜td width="100%"＞
＜p align="center"＞War in Iraq! ＜/td＞
＜/tr＞
＜/table＞
＜table border="0" width="11%" class="Someotherstory"＞
＜tr＞
＜td width="100%"＞
＜p align="center"＞Other content...＜/td＞
＜/tr＞
＜/table＞

The returned content not only contains the <table> tag of the Headline table, but also the Someotherstory table. It can be seen that the "?" here is essential.

This example assumes some fairly idealistic premises. In actual applications, the situation is often much more complicated, especially when you have no influence on the writing of the source HTML code being used, writing ASP code is particularly difficult. The most effective method is to spend more time analyzing the HTML near the content to be extracted, and test frequently to ensure that the extracted content is exactly what you need. In addition, you should pay attention to and handle the situation where the regular expression cannot match any content of the source HTML page. Content can change very quickly, so don't end up with ridiculous errors on your page just because someone else has changed the format of the content.

3. Parse text data files

There are many formats and types of data files. XML documents, structured text and even unstructured text often become data sources for ASP applications. An example we'll look at below is a structured text file using qualifiers. Qualifiers (such as quotation marks) indicate that the parts of the string are inseparable, even if the string contains delimiters that separate the record into fields.

The following is a simple structured text file:

last name, first name, phone number, description Sun, Wukong, 312 555 5656, ASP is very good, Pig, Bajie, 847 555 5656, I am a film producer.

This file is very simple. The first line is the title and the following two lines are the records separated by commas. Parsing this file is also very simple, just split the file into lines (according to newline symbols), and then split each record according to fields. However, if we add commas to a certain field content:

last name, first name, phone number, description Sun, Wukong, 312 555 5656, I like ASP, VB and SQL
Pig, Bajie, 847 555 5656, I am a movie producer.

There will be a problem when parsing the first record, because in the view of a parser that only recognizes comma delimiters, its last field contains the contents of two fields. . To avoid this type of problem, fields containing delimiters must be surrounded by qualifiers. Single quotes are a commonly used qualifier. After adding the single quote qualifier to the above text file, its content is as follows:

last name, first name, phone number, description Sun, Wukong, 312 555 5656, 'I like ASP, VB and SQL'
Zhu, Bajie, 847 555 5656, 'I am a film producer'

Now we can determine which comma is the delimiter and which comma is the field content, that is, we only need to regard the commas appearing inside the quotation marks as the content of the field. . The next thing we have to do is implement a regular expression parser that determines when to split fields based on commas and when to treat commas as field content.

The problem here is slightly different than what most regular expressions face. Usually we look at a small portion of text to see if it matches a regular expression. But here, we can only reliably tell what is inside quotes after considering the entire line of text.

Here's an example illustrating the problem. Randomly extract half a line of content from a text file and get: 1, beach, black, 21, ', dog, cat, duck, ', . In this example, because there is other data to the left of "1", it is extremely difficult to parse its content. We don't know how many single quotes precede this data fragment, so we can't determine which characters are within the quotes (text within quotes cannot be split during parsing). If the data fragment is preceded by an even number (or no) single quotes, then "', dog, cat, duck, '" is a quoted string and is indivisible. If the number of preceding quotes is an odd number, then "1, beach, black, 21, '" is the end of a string and is indivisible.

Therefore, the regular expression must analyze the entire line of text, taking into account how many quotation marks appear to determine whether the character is inside or outside the quotation mark pair, that is:,(?=([^']*'[^']*') *(?![^']*')). This regular expression first finds a quote, then continues to search and ensures that the number of single quotes after the comma is either an even number or zero. This regular expression is based on the following judgment: if the number of single quotes after the comma is even, then the comma is outside the string. The following table gives more detailed instructions:

, looking for a comma
(?= continues searching forward to match this pattern:
(Start a new pattern
[^']*' [non-quote characters] 0 or more, followed by a quote
[^']*'[^']*) [non-quote characters] 0 or more, followed by a quote. Combined with the previous content it matches quote pairs
)* ends the pattern and matches the entire pattern (pairs of quotes) 0 or more times
(?! Search forward, exclude this pattern
[^']*' [non-quote characters] 0 or more, followed by a quote
) Below the end mode

is a VBScript function, which accepts a string parameter, splits the string according to the comma separator and single quote qualifier in the string, and returns the result array:

Function SplitAdv(strInput)
Dim objRE
Set objRE = new RegExp

'Set RegExp object
objRE.IgnoreCase = true
objRE.Global = true
objRE.Pattern = ",(?=([^']*'[^']*')*(?![^']*'))"

' The Replace method uses chr(8) to replace what we want to use Comma, chr(8) isb
' character, b may appear extremely rarely in a string.
' Then we split the string according to b and save it to the array
SplitAdv = Split(objRE.Replace(strInput, "b"), "b")
End Function

In short, using regular expressions to parse text data files has the advantages of high efficiency and shortening development time. It can save a lot of time in analyzing files and extracting useful data based on complex conditions. In a rapidly evolving environment where there will still be a lot of traditional data available, knowing how to construct efficient data analysis routines will be a valuable skill.

4. String replacement

In the last example, we are going to look at the replacement function of VBScript regular expressions. ASP is often used to dynamically format text obtained from various data sources. Utilizing the power of VBScript regular expressions, ASP can dynamically change matching complex text. Highlighting some words by adding HTML tags is a common application, such as highlighting search keywords in search results.

To illustrate how this is done, let's look at an example that highlights all ".NET" in a string. This string can be obtained from anywhere, such as a database or other Web site.

<%
Set regEx = New RegExp
regEx.Global = true
regEx.IgnoreCase = True

'regular expression pattern,
'Look for any word or URL ending with ".NET".
regEx.Pattern = "(b[a-zA-Z._]+?.NETb)"

' String used to test the replacement function
strText = "Microsoft has created a new website www.ASP.NET ."

'Call the Replace method of the regular expression
'$1 means insert the matched text into the current position
Response.Write regEx.Replace(strText, _
"＜b style='color: #000099; font-size: 18pt'＞$1＜/b＞")
%>

There are several important points that must be noted in this example. The entire regular expression is put into a pair of parentheses, and its function is to intercept all matching content for later use, which is referenced by $1 in the replacement text. Up to 9 similar interceptions can be used per substitution, referenced by $1 to $9 respectively. The Replace method of regular expressions is different from the Replace function of VBScript itself. It only requires two parameters: the text to be searched and the text to be replaced.

In this example, to highlight the searched ".NET" strings, we surround them with bold tags and other style attributes. Using this search and replace technology, we can easily add the function of highlighting search keywords to the website search program, or automatically add links to other pages for keywords that appear on the page.

Conclusion

I hope that the several regular expression techniques introduced in this article have inspired you when and how to apply regular expressions. Although the examples in this article are written in VBScript, regular expressions are also useful in ASP.NET. It is one of the main mechanisms for server-side control form validation, and is exported to the entire .Text.RegularExpressions namespace through the System.Text.RegularExpressions namespace. NET framework.