codechunker

Posted on Apr 29, 2020 • Edited on May 2, 2020

INTRODUCTION TO REGEX EXPRESSIONS FOR JAVA DEVELOPERS

#beginners #java #tutorial #codenewbie

According to oracle doc,
Regular Expressions popularly called Regex are ways to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit, or manipulate text and data.

In layman’s term, Regular Expressions are ways that help you to validate or match strings according to a particular format. In other words, you can use it to check if a particular input by a user is correct or use it to retrieve or extract values that match a particular sequence.

IMPORTANCE/USAGES OF REGEX
• Validate a user’s input.
• Search text in a given data. (can be used in Text Editors)
• Parser in compilers
• Syntax highlighting, packet sniffers etc.

To have a thorough basic understanding of Regex we need to understand the following concepts:

QUANTIFIERS ( * + ? {n} . )
The quantifiers in Java ( and in most programming languages) consists of , +, ?, {} .
a) * matches zero or more instances of its preceding pattern. e.g. abc* means the text must match 'ab' followed by zero or more of 'c' i.e. the text may not have ‘c’ attached to it and the text may also have one or more ‘c’ . The preceding pattern in this case is ‘ab’. *TRY IT**

b) + matches one or more instances of its preceding pattern e.g abc+ means that the text must have ‘ab’ followed by one or more ‘c’. so abc is the least you can have that is correct, abcc is also correct. TRY IT

c) ? matches zero or one occurrences of the pattern e.g abc? Means the text can either be abc or ab. TRY IT.

d) {n} matches the exact number (n) specified in the expression. e.g a{2}bc means that you can only have two ‘a’ followed by one ‘b’ and one ‘c’. TRY IT

e) {n, } matches at least the number specified. That means you must have n number or more of the preceding expression e.g ab{2,}c means you must have a, followed by two or more ‘b’ and then c. TRY IT

f) {n,m} matches between n and m (inclusive) of the pattern. This means you can have between to m occurrences of the preceding pattern. e.g ab{2,5}c means abbc, abbbc, abbbbc, abbbbbc are all correct. Anything out of these is wrong. TRY IT.

g) ‘.’ matches all non-white space characters

OPERATORS(| [] ^ )
Pipe (|) is used as ‘OR’. So it matches the expression on the left or the expression on the right. e.g abc|abd means the text should be either abc or abd. TRY IT.

[] means the text should match any of the characters in the angle brackets e.g a[bc] means the text should be ‘a’ followed by either ‘b’ or ‘c’.
[0-9]% means the text should be any number from 0 to 9 followed by ‘%’
[a-zA-Z] means any character as long as it between a-z or A-Z.
[abc] means the text or string should be either a, or b or c.
Adding ‘^’ to any of the expressions negates the meaning of that expression e.g [^abc] means the string should be any character as long as it is not a or b or c. TRY IT.

GROUPING AND BACK REFERENCES
Another interesting thing about regex is that you can group your expressions and probably reuse them when necessary. To group elements in expressions, use round brackets e.g ().
A Back Reference stores the part of the string that matched the group. You can reference a particular group by using the sign $. $1, $2…represent group 1, group 2 etc. the default group is $0 which is the actual text itself. For example, lets remove all the spaces in a sentence .The regex for this example would be the code snippet below:

private static String backReference(String text) {
    String pattern = "(\\w+)([\\s])";
    return text.replaceAll(pattern, "$1");
}

The expression in the code above has 2 groups: (\w+) and (\s). So we are saying the text has to be a bunch of words(\w+) and spaces(\s+). After that, we replace the whole text with just group 1($1) which will just remove the spaces because all we need now is just words. Referencing $1 in the back referencing.

NOTE: In java, we need to escape the character class (\w and \s) with another slash otherwise you would have a syntax error.

LOOK AHEAD/LOOK BEHIND
This is a way to exclude a pattern. So you can say a string should be valid only if it doesn’t come before a particular string and vice versa. e.g. abc(?=d) will match ‘abc’ as long as it follows a ‘d’. but note that ‘d’ will be excluded. TRY IT.
A good real life application of this would be to get the value in any HTML tag e.g <h1>Hello World</h1>. The expression would be (?i)(<h1.*?>)(.*(?=<))() TRY IT. This returns Hello World
You can also use (?!pattern) as a negative look ahead e.g ab(?!c) will match ab if ‘ab’ is not followed by ‘b’.

GREEDY AND LAZY
The quantifiers above are said to be greedy in the sense that they’ll match as many occurrences as long as the match is still successful. e.g for the regex a.+c, you would expect it to mean the text should be ‘a’ followed by one or more non-white space character. So you would expect a possible match to be ‘abcccd’ or ‘abcc’ or ‘abbc’ as individual results. This is meant to be true but because of the greedy attitude, it will grab all the texts (abcccd abcc abbc) as one and return ‘abcccd abcc abbc’ as one result because if you notice, the first character is ‘a’ followed by one or more of any other character and it now ends with c which matches exactly a.+c

In order to restrict this “greedy attitude”, just add question mark (?) in front of the quantifier, this makes the quantifier a bit reluctant which means it will match as few occurrences as possible as long as the match is successful. So ab.+?c would return the results individually instead of grabbing the whole string and processing them as one. TRY IT.

A better application of this would be this: Assuming you want to get just the tags <h1> and </h1> seperately from <h1>Hello World</h1>, you would expect that the regex for it would be <.+> but in reality, it is going to grab the whole text (<h1>Hello World</h1>) and process it as one and still return the same text to you because it actually matches the expression you provided (the text should start with ‘<’, followed by one or more of any other character and then end ‘>’ ) which the text satisfies TRY IT.
Adding ‘?’ in front of the quantifier is sometimes called lazy.

In a nutshell, ‘Greedy’ means match longest possible string while ‘lazy’ means match shortest possible string.

NOTE: Be careful when using these quantifiers especially period(.) in expressions.

CHARACTER CLASSES
A character class is an escape sequence that represents a group of characters. Some predefined characters in java are listed below:

If you notice from the above table, the Uppercase letters negate the functions of the lowercase letters vice versa

NOW OVER TO THE WORLD OF JAVA
String class in Java comes with an inbuilt boolean method called matches that matches a string against a regex.

public static void main(String[] args) {
    String value = "12345";
    System.out.println("The Result is: "+value.matches("\\d{5}"));
}

The code snippet above would return ‘The Result is: true’ because the value (12345) matches exactly 5 characters. Any thing other than 5 would return ‘The Result is: false’.

Apart from matches method of the String class, most of the classes you would need for regex are found in java.util.regex package. They consist of three classes:
• Pattern: this is the compiled representation of of a regular expression. To use this, you must first call the static method (compile) in the pattern class which returns a Pattern object.
• Matcher: This is the engine that that interprets the pattern and performs match operations against an input string. To get an object, you have to invoke the matcher method on a Pattern object.
• PatternSyntaxException: this indicates error in a regular expression pattern

To Match a string using Pattern/Matcher Classes

public static void main(String[] args) {
    String value = "12345";
    String regex = "\\d{5}";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(value);
    System.out.println("The Result is: "+matcher.matches());
}

Now the question is why would I go through the stress of using Pattern/Matcher when I can just match the string against the regex (value.matches(regex)).
Well the truth is the implementation of method matches in class String using Pattern/Matcher under the hood. So, for every string you match, it creates a Pattern object before matching.

When to use Pattern/Matcher VS String.matches(String text)

public class PatternMatcher {
    public static void main(String[] args) {
        String [] texts = {"nozzle","punjabi","waterlogged","imprison","crux","numismatlogists","sultans","rambles","deprecating","aware","outfield","marlborough","guardrooms","roast","wattage","shortcuts","confidential","reprint","foxtrot","disposseslogsion","floodgate","unfriendliest","semimonthlies","dwellers","walkways","wastrels","dippers","engrlogossing","undertakings"};
         List<String> resultStringMatches = matchUsingStringMatches(texts);
        List<String> resultsPatternMatcher = matchUsingPatternMatcher(texts);
        System.out.println("The Result for String matches is: "+resultStringMatches.toString());
        System.out.println("The Result for pattern/matcher is: "+resultStringMatches.toString());
    }

    private static List<String> matchUsingPatternMatcher(String[] texts) {
        List<String> matchedList = new ArrayList<>();
        String pattern = "[a-zA-Z]*log[a-zA-Z]*";
        Pattern regexPattern = Pattern.compile(pattern);
        Matcher matcher;
        for (int i = 0; i < texts.length; i++) {
             matcher = regexPattern.matcher(texts[i]);
            if (matcher.matches()) {
                matchedList.add(texts[i]);
            }
        }
        return matchedList;
    }

    private static List<String> matchUsingStringMatches(String[] texts) {
        String pattern = "[a-zA-Z]*log[a-zA-Z]*";
        List<String> matchedList = new ArrayList<>();

        for (int i = 0; i < texts.length; i++) {
            //for each time it calls matches(pattern), it creates a pattern object
            if (texts[i].matches(pattern)) {
                matchedList.add(texts[i]);
            }
        }
        return matchedList;
    }
}

The code above matches all the elements in an array against a particular regex ([a-zA-Z]log[a-zA-Z]). The regex is meant to get all texts that have the word “log”. Before a match against a regex, that expression must be compiled (Pattern.compile);
The first method matchUsingPatternMatcher() compiles the pattern first before looking for matches while the second method matchUsingStringMatches() creates a new Pattern Object

(Pattern.compile()) for every element in the array which is expensive/costly and may cause memory leakage especially then the data/texts are too many.
So if you care about performance when dealing with large set of data, use the Pattern/Matcher class to compile first and use the instance to match the texts.

Thank you very much for reading. Please you can leave a comment, suggestion or correction on the comment section below.

DEV Community

INTRODUCTION TO REGEX EXPRESSIONS FOR JAVA DEVELOPERS

Top comments (0)

Read next

How to customize Strapi Dashboard / Home Page: The Right Way

How to Become a Successful Software Developer in 2024

Day 18: Deploying Docker to the Cloud

How to Optimize Your React Web App: 7 Essential Steps