Key concepts behind advanced regular expression techniques

Author：Eve Cole Update Time：2009-11-18 17:51:24

The original English text comes from Smashing Magazine . Translated by Benhuoer . Please indicate the source when reprinting.

Regular expressions (Regular Expression, abbr. regex) are powerful and can be used to find the required information in a large string of characters. It uses conventional character structure expressions to work. Unfortunately, simple regular expressions are not powerful enough for some advanced applications. If the filtering structure is more complex, you may need to use advanced regular expressions.

This article introduces you to advanced techniques of regular expressions. We have screened out eight commonly used concepts and analyzed them with examples. Each example is a simple way of writing to meet certain complex requirements. If you still lack understanding of the basic concepts of regularization, please read this article, this tutorial, or the Wiki entry first.
The regular syntax here is suitable for PHP and is compatible with Perl.

1. Greed/Laziness

All regular operators that can be qualified multiple times are greedy. They match as much of the target string as possible, which means the match result will be as long as possible. Unfortunately, this approach is not always what we want. Therefore, we add "lazy" qualifier to solve the problem. Adding "?" after each greedy operator causes the expression to match only the shortest possible length. In addition, the modifier "U" can also lazytize operators that can be qualified multiple times. Understanding the difference between greedy and lazy is the basis for using advanced regular expressions.

greedy operator

Operator * matches the previous expression zero or more times. It is a greedy operator. Please see the following example:

preg_match( '/<h1>.*< /h1>/', '</h1><h1>This is a title.</h1>
<h1>This is another one. </h1>', $matches );

The period (.) can represent any character except newline characters. The above regular expression matches the h1 tag and everything within the tag. It uses periods (.) and asterisks (*) to match everything within the tag. The matching results are as follows:

<h1>This is a title. </h1>
<h1>This is another one. </h1>

The entire string is returned. The * operator will match everything in a row - even the closing h1 tag in the middle. Because it is greedy, matching the entire string is in line with its principle of maximizing interests.

lazy operator

Modify the above formula slightly and add a question mark (?) to make the expression lazy:

/<h1>.*?< /h1>/</h1>

In this way, it will feel that it only needs to match the first h1 ending tag to complete the task.
Another greedy operator with similar properties is {n,}. It means that the previous matching pattern is repeated n times or more. If no question mark is added, it will look for as many repetitions as possible. If it is added, it will be repeated as little as possible (of course, "repeat n times" is the minimum ).

#Create string
$str = 'hihihi oops hi';
# Use the greedy {n,} operator for matching
preg_match( '/(hi){2,}/', $str, $matches ); # matches[0] will be 'hihihi'
# Use the degraded {n,}? operator to match
preg_match( '/(hi){2,}?/', $str, $matches ); # matches[0] will be 'hihi'

Source: stupid work