Java’s Pattern class and regular expressions

One of the easiest things to get tripped up on is the syntax for creating regular expressions (regex) in Java using the Pattern class. The tl;dr version of how to do things is that you must use double-backslashes in the regular expression Strings you use to create a Pattern object; so something like \b would have to be written as "\\b". Read on for a more thorough explanation.

Double trouble

The key point to understanding the tricky syntax is to realize that when you’re creating a String literal in Java, backslashes are used to form escape sequences as well. Most people are familiar with this concept, when, for example, constructing a String that spans multiple lines:

final String multiline = "A String...\nOn two lines";

When calling Pattern.compile, you pass in a String literal that is the regular expression. However, regular expressions also use the backslash character to begin escape sequences. So, to ensure that the regular expression engine in Pattern gets the correct syntax, you must replace every backslash in your regular expression with two backslashes. This is to prevent Java from interpreting the single backslash as just a String escape sequence.

Or, put another way, if you wanted a String with the contents "\n", that is a String with a backslash followed by the letter ‘n’, you’d have to define it as:

final String newLineEscapeSequence = "\\n";

This is the gist of it; we need to pass in the preserved backslashes into the Pattern regular expression engine, so you have to create a literal backslash by using a double-backslash in your String literal. This information is in the Pattern Javadoc, but it’s sort of buried beneath loads of regular expression syntax.

Keep this in mind when constructing your regular expressions outside of Java in a tool like RegExr. These principles also apply when using other classes/methods that use Pattern, such as String.split() or Scanner.useDelimiter()

An example

Here’s a simple example where we try to find the word “The” at the beginning of a String, delimited by a word boundary matcher.

public class PatternExample {
  private static final Logger LOGGER =
      Logger.getLogger(PatternExample.class);
  private static final String TEST_STRING =
      "The quick brown fox jumps over the lazy dog";

  public static void main(final String[] args) {
    System.out.println(TEST_STRING);

    final Pattern wordBoundaryWrong = Pattern.compile("^The\b.*");
    Matcher matcher = wordBoundaryWrong.matcher(TEST_STRING);
    LOGGER.debug(matcher.matches()); // false.

    final Pattern wordBoundaryCorrect = Pattern.compile("^The\\b.*");
    matcher = wordBoundaryCorrect.matcher(TEST_STRING);
    LOGGER.debug(matcher.matches()); // true.
  }
}

The key point here is that the word boundary matcher (\b) must be passed in as a String literal of "\\b" so that the backslash is properly interpreted. In the incorrect Pattern, "\b" maps to a backspace character literal.

I think the reason this concept is somewhat tricky is that you have to deal with two levels of escaping – the Java String literal syntax and the Regular Expression syntax.

One Comment »

  1. But there are also those who are not amenable to this kind of treatment such as
    Dr. The rhinoplasty surgeons India enhance the nose by reshaping it
    by using components like cuboids, gristle (fibrous) and
    a couple of fashioned elements as well. First decide on what color you want
    your bathroom to be and add two shades of the same color, one lighter and
    one darker (or one totally different but matching
    color) for accents.

    Take a look at my page: plastic surgery games

Post a Comment

(required)

(will not be published) (required)

XHTML tags allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: rel="nofollow" will be added to all links in comments.