Java’s Pattern class and regular expressions

One of the easiest things to get tripped up on is the syntax for creating regular expressions (regex) in Java using the Pattern class. The tl;dr version of how to do things is that you must use double-backslashes in the regular expression Strings you use to create a Pattern object; so something like \b would have to be written as "\\b". Read on for a more thorough explanation.

Double trouble

The key point to understanding the tricky syntax is to realize that when you’re creating a String literal in Java, backslashes are used to form escape sequences as well. Most people are familiar with this concept, when, for example, constructing a String that spans multiple lines:

final String multiline = "A String...\nOn two lines";

When calling Pattern.compile, you pass in a String literal that is the regular expression. However, regular expressions also use the backslash character to begin escape sequences. So, to ensure that the regular expression engine in Pattern gets the correct syntax, you must replace every backslash in your regular expression with two backslashes. This is to prevent Java from interpreting the single backslash as just a String escape sequence.

Or, put another way, if you wanted a String with the contents "\n", that is a String with a backslash followed by the letter ‘n’, you’d have to define it as:

final String newLineEscapeSequence = "\\n";

This is the gist of it; we need to pass in the preserved backslashes into the Pattern regular expression engine, so you have to create a literal backslash by using a double-backslash in your String literal. This information is in the Pattern Javadoc, but it’s sort of buried beneath loads of regular expression syntax.

Keep this in mind when constructing your regular expressions outside of Java in a tool like RegExr. These principles also apply when using other classes/methods that use Pattern, such as String.split() or Scanner.useDelimiter()

An example

Here’s a simple example where we try to find the word “The” at the beginning of a String, delimited by a word boundary matcher.

public class PatternExample {
  private static final Logger LOGGER =
  private static final String TEST_STRING =
      "The quick brown fox jumps over the lazy dog";

  public static void main(final String[] args) {

    final Pattern wordBoundaryWrong = Pattern.compile("^The\b.*");
    Matcher matcher = wordBoundaryWrong.matcher(TEST_STRING);
    LOGGER.debug(matcher.matches()); // false.

    final Pattern wordBoundaryCorrect = Pattern.compile("^The\\b.*");
    matcher = wordBoundaryCorrect.matcher(TEST_STRING);
    LOGGER.debug(matcher.matches()); // true.

The key point here is that the word boundary matcher (\b) must be passed in as a String literal of "\\b" so that the backslash is properly interpreted. In the incorrect Pattern, "\b" maps to a backspace character literal.

I think the reason this concept is somewhat tricky is that you have to deal with two levels of escaping – the Java String literal syntax and the Regular Expression syntax.

Comments for this entry are closed

But feel free to indulge in some introspective thought.