In-depth analysis of Java regular expressions

Author：Eve Cole Update Time：2024-11-18 15:36:01

1. regex (regular expression): RegularExpressions (replacing StringTokenizer); a powerful tool for string processing; popular in Unix, Perl is even better using regex.
Mainly used in string matching, search and replacement. For example: matching IP (range less than 256) is easy to use with regular expressions; extracting a large number of email addresses from web pages to send spam; extracting links from web pages. Contains Matcher (the result of matching a string with a pattern) and pattern.

Copy the code code as follows:

/*
* Tells whether this string matches the given regular expression (also a string).
*/
System.out.println("abc".matches("..."));//Each "." represents a character

Copy the code code as follows:

/*
* Replace all numbers in the string with "-". The ordinary method requires charAt to be judged one by one;
* "//d" represents any number or replace it with "[0-9]";
* "//D" represents any non-digit or replaced with "[^0-9]"
*/
System.out.println("ab54564654sbg48746bshj".replaceAll("[0-9]", "-"));//Each "." represents one character

two,

Copy the code code as follows:

/*
* compile compiles the given regular expression into the pattern (each compilation takes time); {3} means exactly three times.
* X{n} X, exactly n times
* X{n,} X, at least n times
* X{n,m} X, at least n times, but not more than m times
*/
Pattern p = Pattern.compile("[az]{3}");
Matcher m = p.matcher("ggs");//Create a matcher that matches the given input with this pattern. Internally, a priority state automaton is actually created (compilation principle)
//The string to be matched in matcher and matches is actually CharSequence (interface), but String implements this interface and has polymorphism
System.out.println(m.matches());//If it is "ggss", it will not match
//You can always directly "ggs".matches("[az]{3}"), but the above has advantages, at least it is more efficient, and Pattern and Matcher provide many functions

3. Call Meta Character in regex ". * +"; ctrl + shift + "/" means comment, replace it with "/" to remove the comment.

Copy the code code as follows:

"a".matches(".");//true, "." represents any character, including Chinese characters
"aa".matches("aa");//true, that is to say, ordinary strings can also be used as regular expressions
/*
* true, "*" means 0 or more characters, but the following characters must be the same as the first one.
* Otherwise, it is false, that is, it is judged whether the string is a string composed of single characters.
*/
"aaaa".matches("a*");
"".matches("a*");//true
"aaa".matches("a?");//true, once or 0 times
"".matches("a?");//true
"a".matches("a?");//true
"544848154564113".matches("//d{3,100}");//true
//This is the simplest IP judgment, but if it exceeds 255, it cannot be judged.
"192.168.0.aaa".matches("//d{1,3}//.//d{1,3}//.//d{1,3}//d{1,3}" );
"192".matches("[0-2][0-9][0-9]");

4. [abc] means matching any character; [^abc] means a character other than abc (must still be a letter, and false will be returned if it is an empty string); [a-zA-Z] is equivalent to "[az ]|[AZ]" is a certain uppercase or lowercase letter; [AZ&&[ABS]] means any one of the uppercase letters and ABS.

Copy the code code as follows:

//I found that there is no difference between | and ||, but there is a difference between & and &&. I don’t know if this is the right way to understand it.
System.out.println("C".matches("[AZ&&[ABS]]"));//false
System.out.println("C".matches("[AZ&[ABS]]"));//true
System.out.println("A".matches("[AZ&&[ABS]]"));//true
System.out.println("A".matches("[AZ&[ABS]]"));//true
System.out.println("C".matches("[AZ|[ABS]]"));//true
System.out.println("C".matches("[AZ||[ABS]]"));//true

5. /w word characters: [a-zA-Z_0-9] when matching user names; /s blank characters: [ /t/n/x0B/f/r]; /S non-blank characters: [^/s ] ;/W non-word characters: [^/w].

Copy the code code as follows:

" /n/t/r".matches("//s{4}");//true
" ".matches("//S");//false
"a_8".matches("//w{3}");//true
//"+" means once or multiple times
"abc888&^%".matches("[az]{1,3}//d+[&^#%]+");//true
/*
* The character to be matched is just a backslash, but it cannot be written as "/" and then combined with the following ",
* The previous "cannot match will result in CE.
* cannot be followed by "//", otherwise a runtime error will occur (no problem in compilation), it must be written as "////"
*/
System.out.println("//".matches("////"));//true

6. POSIX character class (US-ASCII only)

Copy the code code as follows:

/p{Lower} Lowercase alphabetic characters: [az] ;/p{Upper} Uppercase alphabetic characters: [AZ] ;/p{ASCII} All ASCII: [/x00-/x7F] ;/p{Alpha} Alphabetic characters: [/p{Lower}/p{Upper}] ;/p{Digit} Decimal number: [0-9].

7. Boundary matcher
^ Beginning of line $ End of line/b Word boundary/B Non-word boundary/A Beginning of input/G End of previous match/Z End of input, only used for the last terminator (if any)
/z end of input

Copy the code code as follows:

"hello world".matches("^h.*");//The beginning of the ^ line
"hello world".matches(".*ld$");//End of $line
"hello world".matches("^h[az]{1,3}o//b.*");///bword boundary
"helloworld".matches("^h[az]{1,3}o//b.*");

" /n".matches("^[//s&&[^//n]]*//n$");//Judge blank lines, which start with a blank character

8. You can also use m.start() and m.end() under the find method to return the next one between the start position and the end position; if it cannot be found, an error will occur.

Copy the code code as follows:

Pattern p = Pattern.compile("//d{3,5}");
String s = "133-34444-333-00";
Matcher m = p.matcher(s);
m.matches();//matches matches all strings
m.reset();
/*
* If the reset method is called first below, true, true, true, false will be output.
* Otherwise, the penultimate find also outputs false.
*The reasons are as follows:
* matches matches the first "-" and finds that it does not match, but these four characters have been eaten. Matching again will start from
* 34444 starts, and the second find starts from 333, because find matches the next subsequence.
* The reset method lets matches spit out the string eaten by matches.
* In summary: reset must be used between matches and find, because they affect each other
*
*/
m.find();
m.find();
m.find();//Try to find the next subsequence of the input sequence that matches this pattern
m.find();
/*
* Attempts to match an input sequence starting at the beginning of the region with this pattern.
* The author of Thinking in java severely criticized this method because it is not clear from the words where to start matching.
* All the following are true, because they start from scratch every time
*/
m.lookingAt();
m.lookingAt();
m.lookingAt();
m.lookingAt();

9. String replacement

Copy the code code as follows:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestRegexReplacement {

public static void main(String[] args) {

Pattern p = Pattern.compile("java",Pattern.CASE_INSENSITIVE);//The following parameter is an integer, which means "case insensitive"
Matcher m = p.matcher("Java java hxsyl Ilovejava java JaVaAcmer");
while(m.find()) {
System.out.println(m.group());//m.group will output all java (ignoring case)

}

String s = m.replaceAll("Java");//String also has this method
System.out.println(s);

m.reset();//must be added because find and matcher affect each other
StringBuffer sb = new StringBuffer();
int i = 0;
/*
* The following method is to replace the odd number of java found with "Java" and the even number with "java"
*/
while(m.find()) {
i++;
//It cannot be written directly as i&1 and must be converted into boolean
if((i&1)==1) {
m.appendReplacement(sb, "Java");
}else {
m.appendReplacement(sb, "java");
}
}

m.appendTail(sb);//Add the remaining string after the last java found
System.out.println(sb);//Without reset, only Acmer is output
}
}

10. Grouping

Copy the code code as follows:

/*
* Add parentheses respectively, not counting the outermost brace, the first left parenthesis is the first group
*/
Pattern p = Pattern.compile("(//d{3,5})([az]{2})");
String s = "123aaa-77878bb-646dd-00";
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
System.out.println(m.group(1));//Output each pair of matching numbers
System.out.println(m.group(2));//Output each pair of matching letters
}

11. Capture emails from web pages

Copy the code code as follows:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/*
* If you need any method of raising, please provide the method name first.
* Then ctrl + 1 lists recommendations and the system creates this method
*/
public class EmailSpider {

public static void main(String[] args) {
// TODO Auto-generated method stub
try {
BufferedReader br = new BufferedReader(new FileReader("F://regex.html"));
String line = "";
try {
while((line=br.readLine())!=null) {
solve(line);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}

private static void solve(String line) {
// TODO Auto-generated method stub
//If the regular expression does not meet the corresponding function, it will not make an error because it is a string.
Pattern p = Pattern.compile("[//w[.-]]+@[//w[.-]]+//.[//w]+");
Matcher m = p.matcher(line);

while(m.find()) {
System.out.println(m.group());
}

}

12. Code statistics

Copy the code code as follows:

View Code
/*
* Count the number of blank lines, comment lines, and program lines in the code
* In fact, you can also use startsWith and endsWith in String.
* If used by the project manager, it is also necessary to count whether the number of characters in each line ends with {; to prevent laziness
*/
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

public class CoderCount {

static long normalLines = 0;
static long commentLines = 0;
static long whiteLines = 0;

public static void main(String[] args) {
File f = new File("D://share//src");
File[] codeFiles = f.listFiles();
for(File child : codeFiles){
if(child.getName().matches(".*//.java$")) {
solve(child);
}
}

System.out.println("normalLines:" + normalLines);
System.out.println("commentLines:" + commentLines);
System.out.println("whiteLines:" + whiteLines);

}

private static void solve(File f) {
BufferedReader br = null;
boolean comment = false;
try {
br = new BufferedReader(new FileReader(f));
String line = "";
while((line = br.readLine()) != null) {
/*
* //Some comment lines have a tab in front of them
* cannot be written after readLine
* The last line will be a null pointer
*/
line = line.trim();
//readLine removes the newline after reading the string.
if(line.matches("^[//s&&[^//n]]*$")) {
whiteLines++;
} else if (line.startsWith("/*") && !line.endsWith("*/")) {
commentLines++;
comment = true;
} else if (line.startsWith("/*") && line.endsWith("*/")) {
commentLines++;
} else if (true == comment) {
commentLines++;
if(line.endsWith("*/")) {
comment = false;
}
} else if (line.startsWith("//")) {
commentLines++;
} else {
normalLines++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if(br != null) {
try {
br.close();
br = null;
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

}

13. Quantifiers
include? *+; The default is Greedy, as well as Reluctant and Possessive (exclusive).

Copy the code code as follows:

//The grouping is added to see it more clearly.
Pattern p = Pattern.compile("(.{3,10})+[0-9]");
String s = "aaaa5bbbb6";//The length is 10
Matcher m = p.matcher(s);
/*
* Now output 0-10, the default is Greedy, swallow 10 characters first, find that they do not match, spit out one, and find that they match;
* If Pattern.compile("(.{3,10}?)+[0-9]") becomes a Reluctant, then it will swallow three characters first, find a mismatch, continue to swallow until it matches, and output 0 to 5;
* If Pattern.compile("(.{3,10}++)+[0-9]") is Possessive (exclusive), it also swallows 10 characters first, but does not spit them out, then it will not matched,
* This method is mainly used where high efficiency is required (there will be errors).
*/
if(m.find()) {
System.out.println(m.start() + "----" + m.end());
}else {
System.put.println("Not match!");
}

14. Supplement (non-capturing group)

Copy the code code as follows:

//The meaning of the non-capturing group is opposite to the literal meaning, which means to capture if it matches
Pattern p = Pattern.compile("(?=a).{3}");
/*
* Output a66, which is equivalent to requiring it to start with a. You can also write Pattern.compile("[a].{2}");
* If Pattern.compile(".{3}(?!=a)") does not end with a {2}[^a], but the next character is not a (lookahead), 44a, 66b will be output, so this This usage is not commonly used;
* If it is Pattern.compile(".{3}(?=a)"), it will output 444 (because ?=a is a lookahead). If it is placed in the front, it will be included in the group, and if it is placed in the back, it will not be included in the group;
*
*
*/
String s = "444a66b";
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}

15. Back Reference

Copy the code code as follows:

Pattern p = Pattern.compile("(//d//d)//1");
/*
* Output true, //1 means the same as the first group, it will be wrong if it is changed to 1213;
* If it is Pattern.compile("(//d(//d))//2"), it needs to be changed to 122.
*
*/
String s = "1212";
Matcher m = p.matcher(s);
System.out.println(m.matches());

16. Abbreviation of flags
"." does not match newlines, just remember CASE_INSENSITIVE, the abbreviation of "case-insensitive matching can also be enabled through the embedded flag expression (?i)".