How to quickly master regular expressions? Learn regular grammar through AST!

Author：Eve Cole Update Time：2022-07-27 16:23:18

Regular expression is a logical formula that operates on strings. It is an important and complex technology when processing text data. So how to quickly master regular expressions? The following article recommends a learning method: through AST. Hope it helps everyone!

Regular expressions are basically used to process strings, and it is very convenient to use them to match, extract, and replace strings.

However, learning regular expressions is still somewhat difficult. Concepts such as greedy matching, non-greedy matching, capturing subgroups, and non-capturing subgroups are not only difficult for beginners to understand, but also for many people who have worked for several years.

So what is the best way to learn regular expressions? How to quickly master regular expressions?

I recommend a way to learn regular rules that I think is very good: learning through AST .

The matching principle of regular expressions is to parse the pattern string into AST, and then use this AST to match the target string.

Various information in the pattern string will be stored in the AST after parse. AST is an abstract syntax tree. As the name suggests, it is a tree organized according to a grammatical structure. From the structure of AST, you can easily know the syntax supported by regular expressions.

How to view the AST of a regular expression?

You can view it visually through the website astexplorer.net:

By switching the parse language to RegExp, you can visualize the AST of regular expressions.

As mentioned before, AST is a tree organized according to grammar, so various grammars can be easily sorted out from its structure.

Then let's learn various syntaxes from the perspective of AST:

/abc/

Let's start with the simple one. /abc/ Such a regular can match the string of 'abc', and its AST is like this:

3 Char, the values are a, b, c respectively, the type is simple. The subsequent matching is to traverse the AST and match these three characters respectively.

We tested it using the exec API:

The 0th element is the matched string, and index is the starting index of the matched string. input is the input string.

Let’s try special characters again:

/ddd/

/ddd/ means matching three numbers. d is a metacharacter (meta char) with special meaning supported by regular expressions.

We can also see through AST that although they are also Char, their type is indeed meta:

Any number can be matched using the d metacharacter:

Which ones are meta chars and which ones are simple chars can be seen at a glance through AST.

/[abc]/

Regular supports specifying a set of characters through [], which means that it can match any of the characters.

We can also see from AST that it is wrapped with a layer of CharacterClass, which means character class, that is, it can match any character it contains.

This is indeed the case under test:

/a{1,3}/

Regular expressions support specifying how many times a certain character is repeated, using the form {from,to},

for example, /b{1,3}/ means character b is repeated 1 to 3 times, /[abc ]{1,3}/ means that this a/b/c character class is repeated 1 to 3 times.

As can be seen from AST, this syntax is called Repetition:

It has a quantifier attribute that represents the quantifier. The type here is range, from 1 to 3.

Regular expressions also support the abbreviations of some quantifiers, such as + indicating 1 to countless times, * indicating 0 to countless times, and ? indicating 0 or 1 time.

They are different types of quantifiers:

Some students may ask, what does the greedy attribute here mean?

Greedy means greedy. This attribute indicates whether this Repetition is a greedy match or a non-greedy match.

If you add a ? after the quantifier, you will find that greedy becomes false, which means switching to non-greedy matching:

So what do greedy and non-greedy mean?

Let's see an example.

The default Repetition matching is greedy and will continue to match as long as the conditions are met, so acbac can be matched here.

Adding a ? after the quantifier switches to non-greedy, and only the first one will be matched:

This is greedy matching and non-greedy matching. Through AST, we can clearly know that greedy and non-greedy are for repeated grammar. The default is greedy matching. Add a ? after the quantifier to switch to non-greedy.

(aaa)bbb(ccc)

regular expression supports putting part of the matched string into a subgroup and returning it through ().

Take a look through the AST:

The corresponding AST is called Group.

And you will find that it has a capturing attribute, which defaults to true:

What does this mean?

This is the syntax for subgroup capture.

If you don’t want to capture subgroups, you can write like this (?:aaa)

Look, capturing has become false.

What is the difference between capture and non-capture?

Let’s try it:

Oh, it turns out that the capturing attribute of Group represents whether to extract or not.

We can see from the AST that capture is for subgroups. The default is capture, which means the content of the subgroup is extracted. You can switch to non-capture through ?: and the content of the subgroup will not be extracted.

We are already familiar with using AST to understand regular syntax, but let’s look at something a bit more difficult:

/bbb(?=ccc)/

Regular expressions support the expression of look-ahead assertions through the syntax of (?=xxx), which is used to judge a certain character. Whether the string is preceded by a certain string.

Through AST, you can see that this syntax is called Assertion, and the type is lookahead, which means looking forward, only matching the previous meaning:

What does this mean? Why do you write this? What is the difference between /bbb(ccc)/ and /bbb(?:ccc)/?

Let’s try it:

It can be seen from the results:

/bbb(ccc)/ matches the subgroup of ccc and extracts this subgroup because the default subgroup is captured.

/bbb(?:ccc)/ matches the subgroup of ccc but is not extracted because we set the subgroup not to capture through ?:.

/bbb(?=ccc)/ The subgroup matching ccc is not extracted, indicating that it is also non-capturing. The difference between it and ?: is that ccc does not appear in the matching result.

This is the nature of a lookahead assertion: a lookahead assertion means that a certain string is preceded by a certain string, the corresponding subgroup is non-capturing, and the asserted string will not appear in the matching result.

If it is not followed by that string, it will not match:

/bbb(?!ccc)/

Change ?= to ?! Then the meaning changes. Take a look through AST:

Although the lookahead assertion is still asserted first, there is an additional negative attribute of true.

The meaning is obvious. Originally, it means that the front is a certain string. After negation, it means that the front is not a certain string.

Then the matching result is exactly the opposite:

Now it only matches if it is not a certain string in front of it. This is a negative lookahead assertion.

/(?<=aaa)bbb/

If there is a preceding assertion, there will naturally be a trailing assertion, that is, it will match only if it is followed by a certain string.

In the same way, it can also be denied:

The AST corresponding to (?<=aaa) is easy to think of, which is lookbehind assertion:

The AST corresponding to (?<!aaa) is to add a negative attribute:

Look-ahead assertion and look-behind assertion are the most difficult regular expression syntax to understand. Is it much easier to understand if you learn it through AST ~

Summary:

Regular expressions are a very convenient tool for processing strings, but it is still somewhat difficult to learn. Many people are confused about syntax such as greedy matching, non-greedy matching, capturing subgroups, non-capturing subgroups, lookahead assertions, lookbehind assertions, etc.

I recommend learning regular rules through AST. AST is an object tree organized according to grammatical structure. Various syntaxes can be easily clarified through the names and attributes of AST nodes.

For example, we have clarified through AST:

Repetition syntax (Repetition) is in the form of character + quantifier. The default is greedy matching (greedy is true), which means matching until no matching. Adding a ? after the quantifier switches to non-greedy matching. , stop when one character is matched.

Subgroup syntax (Group) is used to extract a certain string. The default is capturing (capturing is true), which means extraction is required. You can switch to non-capturing through (?:xxx), which only matches but does not extract.

Assertion syntax (Assertion) represents a certain string before or after it. It is divided into lookahead assertion and lookbehind assertion. The syntax is (?=xxx) and (?<=xxx) respectively. You can pass Replace = with ! to express negation (negative is true), which means exactly the opposite.

Is it the deep understanding of syntax in various documents or the deep understanding of syntax in the compiler?

No need to ask, it must be the compiler!

Then it is naturally better to learn grammar through the syntax tree parsed according to the grammar than the document.

This is true for regular expressions, and it is also true for learning other grammars. If you can learn the grammar using AST, you don't need to read the documentation.