Regular expressions define patterns of strings.
Regular expressions can be used to search, edit or manipulate text.
Regular expressions are not limited to one language, but there are subtle differences in each language.
Java regular expressions are most similar to Perl's.
The java.util.regex package mainly includes the following three classes:
Pattern class:
The pattern object is a compiled representation of a regular expression. Pattern class has no public constructor. To create a Pattern object, you must first call its public static compile method, which returns a Pattern object. This method accepts a regular expression as its first parameter.
Matcher class:
The Matcher object is an engine that interprets and matches input strings. Like the Pattern class, Matcher has no public constructor. You need to call the matcher method of the Pattern object to obtain a Matcher object.
PatternSyntaxException:
PatternSyntaxException is a non-mandatory exception class that represents a syntax error in a regular expression pattern.
Capturing groups are a way to treat multiple characters as a single unit, created by grouping characters within parentheses.
For example, the regular expression (dog) creates a single group containing "d", "o", and "g".
Capturing groups are numbered by counting their opening brackets from left to right. For example, in the expression ((A)(B(C))), there are four such groups:
((A)(B(C)))
(A)
(B(C))
(C)
You can check how many groups an expression has by calling the groupCount method of the matcher object. The groupCount method returns an int value, indicating that the matcher object currently has multiple capturing groups.
There is also a special group (group 0) which always represents the entire expression. The group is not included in the return value of groupCount.
The following example shows how to find a string of digits from a given string:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
public static void main( String args[] ){
// Search the string according to the specified pattern String line = "This order was placed for QT3000! OK?";
String pattern = "(.*)(\d+)(.*)";
//Create Pattern object Pattern r = Pattern.compile(pattern);
// Now create matcher object Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );
} else {
System.out.println("NO MATCH");
}
}
}
The compilation and running results of the above example are as follows:
Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0
character | illustrate |
---|---|
| Mark the next character as a special character, text, backreference, or octal escape character. For example, "n" matches the character "n". "n" matches a newline character. The sequence "\" matches "", and "(" matches "(". |
^ | Matches the beginning of the input string. If the Multiline property of the RegExp object is set, ^ will also match the position after "n" or "r". |
$ | Matches the end of the input string. If the Multiline property of the RegExp object is set, $ will also match the position before "n" or "r". |
* | Matches the preceding character or subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}. |
+ | Matches the preceding character or subexpression one or more times. For example, "zo+" matches "zo" and "zoo" but not "z". + Equivalent to {1,}. |
? | Matches the preceding character or subexpression zero or once times. For example, "do(es)?" matches "do" or "do" in "does". ? Equivalent to {0,1}. |
{ n } | n is a non-negative integer. Match exactly n times. For example, "o{2}" does not match the "o" in "Bob", but does match both "o"s in "food". |
{ n ,} | n is a non-negative integer. Match at least n times. For example, "o{2,}" does not match the "o" in "Bob", but matches all o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*". |
{ n , m } | M and n are nonnegative integers, where n <= m . Match at least n times and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note: You cannot insert spaces between commas and numbers. |
? | When this character immediately follows any other qualifier (*, +, ?, { n }, { n ,}, { n , m }), the matching pattern is "non-greedy". The "non-greedy" pattern matches the shortest possible string, while the default "greedy" pattern matches the longest possible string. For example, in the string "oooo", "o+?" matches only a single "o", while "o+" matches all "o"s. |
. | Matches any single character except "rn". To match any character including "rn", use a pattern such as "[sS]". |
( pattern ) | Matches a pattern and captures the matching subexpression. Captured matches can be retrieved from the resulting "matches" collection using the $0…$9 attributes. To match the bracket character ( ), use "(" or ")". |
(?: pattern ) | A subexpression that matches pattern but does not capture the match, i.e. it is a non-capturing match and does not store the match for later use. This is useful when combining pattern parts using the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'. |
(?= pattern ) | A subexpression that performs a forward lookahead search that matches a string at the beginning of a string that matches pattern . It is a non-capturing match, i.e. a match that cannot be captured for later use. For example, 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Prediction lookaheads do not occupy characters, that is, after a match occurs, the next match is searched immediately after the previous match, not after the characters that make up the prediction lookahead. |
(?! pattern ) | A subexpression that performs a backward lookahead search that matches a search string that is not at the beginning of a string that matches pattern . It is a non-capturing match, i.e. a match that cannot be captured for later use. For example, 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but not "Windows" in "Windows 2000". Prediction lookaheads do not occupy characters, that is, after a match occurs, the next match is searched immediately after the previous match, not after the characters that make up the prediction lookahead. |
x | y | Match x or y . For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food". |
[ xyz ] | Character set. Matches any character contained in . For example, "[abc]" matches the "a" in "plain". |
[^ xyz ] | Reverse character set. Matches any characters not included. For example, "[^abc]" matches "p", "l", "i", and "n" in "plain". |
[ az ] | Character range. Matches any character within the specified range. For example, "[az]" matches any lowercase letter in the range "a" through "z". |
[^ az ] | Reverse range character. Matches any character not within the specified range. For example, "[^az]" matches any character that is not in the range "a" through "z". |
b | Matches a word boundary, that is, the position between a word and a space. For example, "erb" matches the "er" in "never", but not the "er" in "verb". |
B | Non-word boundary matching. "erB" matches the "er" in "verb", but not the "er" in "never". |
c x | Matches the control character indicated by x . For example, cM matches Control-M or a carriage return character. The value of x must be between AZ or az. If this is not the case, c is assumed to be the "c" character itself. |
d | Numeric character matching. Equivalent to [0-9]. |
D | Matches non-numeric characters. Equivalent to [^0-9]. |
f | Form feed matches. Equivalent to x0c and cL. |
n | Newline matching. Equivalent to x0a and cJ. |
r | Matches a carriage return character. Equivalent to x0d and cM. |
s | Matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [fnrtv]. |
S | Matches any non-whitespace character. Equivalent to [^fnrtv]. |
t | Tab matching. Equivalent to x09 and cI. |
v | Vertical tab matching. Equivalent to x0b and cK. |
w | Matches any type character, including underscore. Equivalent to "[A-Za-z0-9_]". |
W | Matches any non-word character. Equivalent to "[^A-Za-z0-9_]". |
x n | Matches n , where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. For example, "x41" matches "A". "x041" is equivalent to "x04"&"1". Allow ASCII codes in regular expressions. |
num | Matches num , where num is a positive integer. Backreference to capture match. For example, "(.)1" matches two consecutive identical characters. |
n | Identifies an octal escape code or backreference. If n is preceded by at least n capturing subexpressions, then n is a backreference. Otherwise, if n is an octal number (0-7), then n is the octal escape code. |
nm | Identifies an octal escape code or backreference. If nm is preceded by at least nm capturing subexpressions, then nm is a backreference. If nm is preceded by at least n captures, n is a backreference followed by the character m . If neither of the previous conditions exists, nm matches the octal value nm , where n and m are octal digits (0-7). |
nml | When n is an octal number (0-3) and m and l are octal numbers (0-7), match the octal escape code nml . |
u n | Matches n , where n is a Unicode character represented as a four-digit hexadecimal number. For example, u00A9 matches the copyright symbol (©). |
The index method provides useful index values that indicate exactly where in the input string the match is found:
serial number | Methods and instructions |
---|---|
1 | public int start() returns the initial index of the previous match. |
2 | public int start(int group) Returns the initial index of the subsequence captured by the given group during the previous matching operation |
3 | public int end() returns the offset after the last matching character. |
4 | public int end(int group) Returns the offset after the last character of the subsequence captured by the given group during the previous matching operation. |
The research method examines the input string and returns a Boolean value indicating whether the pattern was found:
serial number | Methods and instructions |
---|---|
1 | public boolean lookingAt() attempts to match an input sequence starting from the beginning of the range to this pattern. |
2 | public boolean find() attempts to find the next subsequence of the input sequence that matches this pattern. |
3 | public boolean find(int start) Resets this matcher and then attempts to find the next subsequence of the input sequence starting at the specified index that matches this pattern. |
4 | public boolean matches() attempts to match the entire range with the pattern. |
The replacement method is a method for replacing text in the input string:
serial number | Methods and instructions |
---|---|
1 | public Matcher appendReplacement(StringBuffer sb, String replacement) implements non-terminal addition and replacement steps. |
2 | public StringBuffer appendTail(StringBuffer sb) implements the terminal addition and replacement steps. |
3 | public String replaceAll(String replacement) Replaces every subsequence of the input sequence whose pattern matches the given replacement string. |
4 | public String replaceFirst(String replacement) Replaces the first subsequence of the input sequence whose pattern matches the given replacement string. |
5 | public static String quoteReplacement(String s) returns the literal replacement string of the specified string. This method returns a string that works just like a literal string passed to the appendReplacement method of the Matcher class. |
Here is an example of counting the number of occurrences of the word "cat" in an input string:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
private static final String REGEX = "\bcat\b";
private static final String INPUT =
"cat cat cat cattie cat";
public static void main( String args[] ){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT); // Get the matcher object int count = 0;
while(m.find()) {
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
}
}
The compilation and running results of the above example are as follows:
Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
Match number 4
start(): 19
end(): 22
You can see that this example uses word boundaries to ensure that the letters "c" "a" "t" are not just a substring of a longer word. It also provides some useful information about where in the input string the match occurred.
The start method returns the initial index of the subsequence captured by the given group during the previous matching operation, and the end method adds one to the index of the last matched character.
The matches and lookingAt methods are both used to try to match an input sequence pattern. The difference between them is that matches requires the entire sequence to match, while lookingAt does not.
These two methods are often used at the beginning of the input string.
We use the following example to explain this function:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
private static final String REGEX = "foo";
private static final String INPUT = "fooooooooooooooooo";
private static Pattern pattern;
private static Matcher matcher;
public static void main( String args[] ){
pattern = Pattern.compile(REGEX);
matcher = pattern.matcher(INPUT);
System.out.println("Current REGEX is: "+REGEX);
System.out.println("Current INPUT is: "+INPUT);
System.out.println("lookingAt(): "+matcher.lookingAt());
System.out.println("matches(): "+matcher.matches());
}
}
The compilation and running results of the above example are as follows:
Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
lookingAt(): true
matches(): false
The replaceFirst and replaceAll methods are used to replace text that matches a regular expression. The difference is that replaceFirst replaces the first match and replaceAll replaces all matches.
The following example explains this functionality:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
private static String REGEX = "dog";
private static String INPUT = "The dog says meow. " +
"All dogs say meow.";
private static String REPLACE = "cat";
public static void main(String[] args) {
Pattern p = Pattern.compile(REGEX);
//get a matcher object
Matcher m = p.matcher(INPUT);
INPUT = m.replaceAll(REPLACE);
System.out.println(INPUT);
}
}
The compilation and running results of the above example are as follows:
The cat says meow. All cats say meow.
The Matcher class also provides appendReplacement and appendTail methods for text replacement:
Look at the following example to explain this functionality:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches
{
private static String REGEX = "a*b";
private static String INPUT = "aabfooaabfooabfoob";
private static String REPLACE = "-";
public static void main(String[] args) {
Pattern p = Pattern.compile(REGEX);
// Get the matcher object Matcher m = p.matcher(INPUT);
StringBuffer sb = new StringBuffer();
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
The compilation and running results of the above example are as follows:
-foo-foo-foo-
PatternSyntaxException is an unforced exception class that indicates a syntax error in a regular expression pattern.
The PatternSyntaxException class provides the following methods to help us see what errors occurred.
serial number | Methods and instructions |
---|---|
1 | public String getDescription() Gets the error description. |
2 | public int getIndex() gets the wrong index. |
3 | public String getPattern() gets the wrong regular expression pattern. |
4 | public String getMessage() returns a multiline string containing a description of the syntax error and its index, the error's regular expression pattern, and a visual indication of the error's index in the pattern. |