Sometimes I don't know how many characters to match. To be able to adapt to this uncertainty, regular expressions support the concept of qualifiers. These qualifiers can specify how many times a given component of a regular expression must appear before the match can be met.
The following table gives an explanation of the various qualifiers and their meanings:
character | describe |
---|---|
* | Matches the previous subexpression zero or multiple times. For example, zo* can match z and zoo. * is equivalent to {0,}. |
+ | Matches the previous subexpression once or more times. For example, 'zo+' can match zo and zoo, but not z. + is equivalent to {1,}. |
? | Matches the previous subexpression zero or once. For example, do(es)? can match do or do in do. ? is equivalent to {0,1}. |
{ n } | n is a non-negative integer. Match the n times that are determined. For example, 'o{2}' cannot match 'o' in Bob, but can match two os in food. |
{ n ,} | n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in Bob, but can match all os in Foooood. 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'. |
{ n , m } | Both m and n are non-negative integers, where n <= m . Match at least n times and match up to m times. Liu, o{1,3} will match the first three os in fooooood. 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers. |
For a large input document, the number of chapters can easily exceed nine chapters, so there is a way to deal with two-digit or three-digit chapter numbers. The qualifier provides this function. The following JScript regular expression can match the chapter title with any number of bits:
/Chapter [1-9][0-9]*/
The following VBScript regular expression performs the same match:
Chapter [1-9][0-9]*
Note that the qualifiers appear after the range expression. Therefore, it will be applied to the entire range expression contained, in this case only numbers from 0 to 9 are specified.
The '+' qualifier is not used here, because a number is not necessarily required at the second or subsequent position. Also no '?' character is used, as this will limit the number of chapters to only two digits. At least one number should be matched after the 'Chapter' and the space characters.
If the chapter limit is known to be 99 chapters, you can use the following JScript expression to specify at least one number, but no more than two numbers.
/Chapter [0-9]{1,2}/
For VBScript, the following regular expressions can be used:
Chapter [0-9]{1,2}
The disadvantage of the above expression is that if there is a chapter number greater than 99, it will still only match the first two digits. Another drawback is that some people can create a Chapter 0 and still match. A better JScript expression to match a double digit number is as follows:
/Chapter [1-9][0-9]?/
or
/Chapter [1-9][0-9]{0,1}/
For VBScript, the following expression is equivalent to the above:
Chapter [1-9][0-9]?
or
Chapter [1-9][0-9]{0,1}
The ' *
', ' +'
and ' ?'
qualifiers are all called greedy , that is, they match as many words as possible. Sometimes this is not what you want to happen at all. Sometimes it just happens to be the smallest match.
For example, you might want to search for an HTML document to find a chapter title that is included in the H1 tag. In a document, the text may have the following form:
<H1>Chapter 1 – Introduction to Regular Expressions</H1>
The following expression matches everything from the beginning less than sign (<) to the end of the H1 marker.
/<.*>/
The regular expression of VBScript is:
<.*>
If what is to be matched is the starting H1 tag, the following non-greedy expressions only match <H1>.
/<.*?>/
or
<.*?>
By placing '?' after the '*', '+' or '?' qualifiers, the expression changes from greedy matches to non-greedy or minimal matches.
Until now, the examples seen only consider finding chapter titles that appear anywhere. Any string 'Chapter' that appears followed by a space and a number may be a real chapter title or a cross reference to other chapters. Since the real chapter title always appears at the beginning of a line, it is necessary to design a method to look for only the title and not the cross reference.
Locators provide this function. A locator can fix a regular expression at the beginning or end of a line. You can also create regular expressions that appear only within words or only at the beginning or end of a word. The following table contains a list of regular expressions and their meanings:
character | describe |
---|---|
^ | Matches the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '/n' or '/r'. |
$ | Matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before '/n' or '/r'. |
/b | Match a word boundary, which means the position between the word and space. |
/B | Match non-word boundaries. |
Qualifiers cannot be used for locators. Since there will be no consecutive positions before or after a newline or word boundary, expressions such as '^*' are not allowed.
To match the text at the beginning of a line of text, use the '^' character at the beginning of the regular expression. Don't confuse this syntax of '^' with its syntax in parentheses expressions. Their syntax is fundamentally different.
To match the text at the end of a line of text, use the '$' character at the end of the regular expression.
To use a locator when looking for chapter titles, the following JScript regular expression will match a chapter title with up to two numbers at the beginning of a line:
/^Chapter [1-9][0-9]{0,1}/
Regular expressions for the same function in VBScript are as follows:
^Chapter [1-9][0-9]{0,1}
A real chapter title not only appears at the beginning of a line, but also has only this content in this line, so it must also be at the end of a line. The following expression ensures that the specified match only matches the chapter and does not match cross references. It is achieved by creating a regular expression that matches only the beginning and end positions of a line of literal.
/^Chapter [1-9][0-9]{0,1}$/
For VBScript, use:
^Chapter [1-9][0-9]{0,1}___FCKpd___16quot;
Matching word boundaries are slightly different, but it adds a very important function to regular expressions. The word boundary is the position between the word and the space. Non-word boundaries are anywhere else. The following JScript expression will match the first three characters of the word 'Chapter' because they appear after the word boundary:
//bCha/
For VBScript:
/bCha
The position of the '/b' operator is critical here. If it is at the beginning of the string to match, a match at the beginning of the word is looked for; if it is at the end of the string, a match at the end of the word is looked for. For example, the following expression will match 'ter' in the word 'Chapter' because it appears before the word boundary:
/ter/b/
as well as
ter/b
The following expression will match 'apt' because it is in the middle of 'Chapter', but will not match 'apt' in 'aptitude':
//Bapt/
as well as
/Bapt
This is because 'apt' appears in the non-word boundary position in the word 'Chapter' and in the word 'aptitude'. The position of non-word boundary operators is not important because the match has nothing to do with the beginning or end of a word.