Regular expression experience

Author：Eve Cole Update Time：2009-06-05 16:24:34

The suggestions in this article mainly focus on the readability of regular expressions. By developing these habits during development, you will more clearly consider the design and expression structure, which will help reduce bugs and code maintenance. If you You will feel more relaxed if you are the maintainer of this code yourself. You can take a look for yourself and pay attention to these experiences with regular expressions in your actual use.
Regular expressions are difficult to write, difficult to read, and difficult to maintain. They often mismatch unexpected text or miss valid text. These problems are caused by the performance and capabilities of regular expressions. The combination of capabilities and nuances of each metacharacter makes the code impossible to interpret without resorting to intellectual tricks.
Many tools include features that make it easy to read and write regular expressions, but they are also very non-idiomatic. For many programmers, writing regular expressions is a magical art. They stick to the characteristics they know and have an attitude of absolute optimism. If you're willing to adopt the five habits discussed in this article, you'll be able to design regular expressions that withstand trial and error.
This article will use the Perl, PHP, and Python languages as code examples, but the advice in this article applies to almost any replacement expression (regex) implementation.

1. Use spaces and comments.

For most programmers, using spaces and indentation in a regular expression environment is not a problem. If they don't do this, they will definitely be laughed at by their peers and even laypeople. Almost everyone knows that squeezing code into one line makes it difficult to read, write, and maintain. What's the difference for regular expressions?
Most replacement expression tools have an extended whitespace feature, which allows programmers to extend their regular expressions into multiple lines and add comments at the end of each line. Why do only a small number of programmers take advantage of this feature? Perl 6's regular expressions use space-extended patterns by default. Don’t let the language expand spaces by default for you, take advantage of them yourself.
One trick to remember about extended whitespace is to tell the regular expression engine to ignore extended whitespace. This way if you need to match spaces, you have to specify it explicitly.
In Perl language, add x at the end of the regular expression, so "m/foo bar/" becomes the following form:

m/
foo

bar
/x

In PHP language, add x at the end of the regular expression, so ""/foo bar/"" becomes the following form:

"/
foo

bar
/x"

In the Python language, pass the pattern modification parameter "re.VERBOSE" to get the compiled function as follows:

pattern = r'''
foo

bar
'''

When

regex = re.compile(pattern, re.VERBOSE)

handles more complex regular expressions, spaces and comments will become more important. Suppose the following regular expression is used to match phone numbers in the United States:

(?d{3})? ?d{3}[-.]d{4}

This regular expression matches phone numbers such as "( 314)555-4000", do you think this regular expression matches "314-555-4000" or "555-4000"? The answer is that neither match. Writing such a line of code hides the shortcomings and design results itself. The telephone area code is required, but the regular expression lacks a separator symbol between the area code and the prefix.
Breaking this line of code into several lines and adding comments will expose the shortcomings and make it easier to modify.
In Perl language it should be in the following form:

/
(? # optional parentheses
d{3} # Required telephone area code
)? # optional parentheses
[-s.]? # The delimiter can be a dash, space or period
d{3} # Three-digit prefix
[-.] # Another delimiter
d{4} # Four-digit phone number
/x

The rewritten regex now has an optional separator after the area code, so that it should match "314-555-4000", however the area code is still required. Another programmer who needs to make the phone area code optional can quickly see that it is now not optional, and a small change can solve the problem.

There are three levels of testing in writing tests. Each level adds a layer of reliability to your code. First, you need to think carefully about what codes you need to match and whether you can handle mismatches. Second, you need to use data instances to test the regular expression. Finally, you need to formally pass a test panel.
Deciding what to match is actually about finding a balance between matching the wrong results and missing the right results. If your regex is too strict, it will miss some correct matches; if it is too loose, it will produce an incorrect match. Once a regular expression is released into actual code, you may not notice both. Consider the phone number example above, which would match "800-555-4000 = -5355". Wrong matches are actually hard to detect, so it's important to plan ahead and test them well.
Continuing with the phone number example, if you are confirming a phone number in a web form, you may be satisfied with a ten-digit number in any format. However, if you want to separate phone numbers from a large amount of text, you may need to carefully exclude false matches that do not meet the requirements.
When thinking about the data you want to match, write down some case scenarios. Write some code to test your regular expression against a case scenario. For any complex regular expression, it is best to write a small program to test it, which can take the following specific form.
In Perl language:

#!/usr/bin/perl

my @tests = ( "314-555-4000",
"800-555-4400",
"(314)555-4000",
"314.555.4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345"
);

foreach my $test (@tests) {
if ( $test =~ m/
(? # optional parentheses
d{3} # Required telephone area code
)? # optional parentheses
[-s.]? # The delimiter can be a dash, space or period
d{3} # Three-digit prefix
[-s.] # Another delimiter
d{4} # Four-digit phone number
/x ) {
print "Matched on $testn";
}
else {
print "Failed match on $testn";
}
}

In PHP language:

<?php
$tests = array( "314-555-4000",

"800-555-4400",
"(314)555-4000",
"314.555.4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345" );

$regex = "/
(? # optional parentheses
d{3} # Required telephone area code
)? # optional parentheses
[-s.]? # The delimiter can be a dash, space or period
d{3} # Three-digit prefix
[-s.] # Another delimiter
d{4} # Four-digit phone number
/x";

foreach ($tests as $test) {
if (preg_match($regex, $test)) {
echo "Matched on $test
;";
}
else {
echo "Failed match on $test
;";
}
}
?＞;

In Python language:

import re

tests = ["314-555-4000",
"800-555-4400",
"(314)555-4000",
"314.555.4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345"
]

pattern = r'''
(? # optional parentheses
d{3} # Required telephone area code
)? # optional parentheses
[-s.]? # The delimiter can be a dash, space or period
d{3} # Three-digit prefix
[-s.] # Another delimiter
d{4} # Four-digit phone number
'''

regex = re.compile( pattern, re.VERBOSE ) for test in tests:
if regex.match(test):
print "Matched on", test, "n"
else:
print "Failed match on", test, "n"

Running the test code will reveal another problem: it matches "1234-123-12345".
In theory, you need to integrate all the tests for the entire application into a test team. Even if you don't have a testing group yet, your regular expression tests will be a good basis for one, and now is a good time to start one. Even if it's not the right time to create it yet, you should still run and test the regular expression after each modification. Spending a little time here will save you a lot of trouble.

3. Group alternating operations

The alternating operation symbol ( ) has a low priority, which means that it often alternates more than the programmer intended. For example, the regular expression to extract email addresses from text might be as follows:

^CC: To:(.*)

The above attempt is incorrect, but this bug is often not noticed. The purpose of the above code is to find the text starting with "CC:" or "To:" and then extract the email address at the end of this line.
Unfortunately, if "To:" appears in the middle of a line, this regular expression will not capture any line starting with "CC:" and will instead extract several random pieces of text. Frankly speaking, the regular expression matches a line starting with "CC:" but captures nothing; or it matches any line containing "To:" but captures the rest of the line. Normally, this regular expression would capture a large number of email addresses, so no one would notice the bug.
If you want to meet the actual intention, then you should add parentheses to make it clear. The regular expression is as follows:

(^CC:) (To:(.*))

If the real intention is to capture text starting with "CC:" or "To:" the rest of the line, then the correct regular expression is:

^(CC: To:)(.*)

This is a common incomplete match bug that you will avoid if you make a habit of grouping for alternating operations This error.

4. Use loose quantifiers.

Many programmers avoid using loose quantifiers such as "*?", "+?" and "??", even though they will make the expression easier to write and understand.
Relaxed quantifiers match as little text as possible, which helps the exact match succeed. If you wrote "foo(.*?)bar", the quantifier would stop matching the first time it encounters "bar", not the last time. This is important if you wish to capture "###" from "foo###bar+++bar". A strict quantifier would capture "###bar++ +". ;), this will cause a lot of trouble. If you use relaxed quantifiers, you can generate new regular expressions by spending very little time assembling character types.
Relaxed quantifiers are of great value when you know the structure of the context in which you want to capture the text.

5. Use available delimiters.

Perl and PHP languages often use a left slash (/) to mark the beginning and end of a regular expression. Python language uses a set of quotation marks to mark the beginning and end. If you insist on using left slashes in Perl and PHP, you will want to avoid any slashes in expressions; if you use quotes in Python, you will want to avoid backslashes (). Choosing different delimiters or quotes can allow you to avoid half of the regular expression. This will make expressions easier to read and reduce potential bugs caused by forgetting to avoid symbols.
The Perl and PHP languages allow any non-numeric and space characters to be used as delimiters. If you switch to a new delimiter, you can avoid missing the left slash when matching URLs or HTML tags (such as "http://" or "<br/>;").
For example, "/http://(S)*/" can be written as "#http://(S)*#".
Common delimiters are "#", "!" and " ". If you use square brackets, angle brackets, or curly braces, just keep them matched. Here are some examples of common delimiters:
#…# !…! {…} s … … (Perl only) s[…][…] (Perl only) s＜…＞;/…/ (Perl only)
In Python, a regular expression is first treated as a string. If you use quotes as delimiters, you will miss all backslashes. But you can avoid this problem by using "r''" string. If you use three consecutive single quotes for the "re.VERBOSE" option, it will allow you to include newlines. For example, regex = "( file://w+)(//d +)" can be written in the following form:

regex = r'''
(w+)
(d+)
'''