Crafting regexes to avoid stack overflows

by sebastian hungerecker|

A developer climbs a ladder to add yet another `a|b` box to an already teetering stack

We've been working recently on adding rules to help write better regular expressions in Java. I've talked already about rules to find errors in character classes and with the use of boundary markers and overly complex regular expressions. Those rules help you make sure your regular expressions are accurate and maintainable. Today I will show you how to make sure that the regular expression won’t crash your program. 

As a brief reminder: regular expressions are a terse and powerful mechanism for matching patterns in strings. Part of the power of regular expressions is that a pattern can concisely describe a string that might be much larger than the original pattern. Sometimes much, much larger. Sometimes enough to overflow the stack and crash your application.

Due to the way regular expression matching is implemented in Java (and many other languages/libraries), matching a pattern may - depending on the regex - require stack space proportional to the length of the input. This means large inputs could cause the program to crash with a StackOverflowException when you try to use the regex.

We ran into this problem in our own code when analysis tried  to find comments containing at least one word with more than a given number of characters. For this we used a regex similar to #(.|\n)*\w{3,}, which would cause the analysis to crash on source files containing sufficiently long comments. To prevent problems like this for our future selves as well as for you, dear users, we implemented a rule which can detect problems like this:

java:S5998 - Regular expressions should not overflow the stack

An issue is raised on a repetition in a regex which could lead to a stack overflow.

In theory any regex is susceptible to stack overflow if it contains a repetition that contains some sort of branching, such as another repetition or an alternative. However, for some regular expressions the length of input required to make them crash is much larger than for others.

To account for this, our rule tries to estimate how much stack space a regular expression will consume relative to the input size and only raises issues on regular expressions whose stack consumption exceeds a configurable threshold.

One way to avoid stack consumption in regular expressions is to use possessive quantifiers. Possessive quantifiers are created by adding a + to a quantifier (e.g. x*+ instead of x*). Doing so disables backtracking. In addition to avoiding issues with catastrophic backtracking, making a quantifier possessive allows a pattern to be matched without consuming stack space. The problem is that sometimes backtracking is necessary and using possessive quantifiers to disable it may leave you with a regex that can never match any input.

Consider for example the regex I cited above: #(.|\n)*\w{3,}: If we try to fix the stack overflow by making the quantifier possessive, we end up with #(.|\n)*+\w{3,}. This regex can never match anything because any input that could be matched by \w will already have been matched by (.|\n)*+ and, being possessive, it won’t give it back. Luckily we have a rule that warns you about issues like this:

java:S5994 - Regex patterns following a possessive quantifier should not always fail

An issue is raised on an impossible-to-reach sub-pattern

So how would one properly fix this issue? In our case, we decided to get rid of the regex altogether since there wasn’t really a good reason to use a regex here. But the regex could also be salvaged easily enough by getting rid of the alternation. The intent of .|\n was to match any character, including line breaks, because . does not match line breaks by default. The proper way to match any character would be to enable the DOTALL flag, which can also be enabled from within the regex using (?s). So either Pattern.compile(“#.*\\w{3,}”, Pattern.DOTALL) or ”(?s)#.*\\{3,}” would work (the latter of which also works when using methods that don’t involve Pattern.compile).

Another regex susceptible to stack overflows would be something as simple as (a|b)*. Here the alternation can be removed by using a character class instead, which doesn’t require stack space to match: [ab]*. Luckily we have a rule that finds alternations that can be replaced by character classes and suggests just that replacement like in the following Spring Boot code:

java:S6035 - Single-character alternations in regular expressions should be replaced with character classes

An issue is raised on an alternation can can be replaced with a character class.

Regular expressions are a truly powerful feature. Because great power comes with great responsibility (and great opportunities for screwing things up, or confusing your colleagues and future-you), we now offer a total of 24 rules targeting regular expressions. They are all available today in SonarCloud. These three rules around avoiding stack overflows are available starting from SonarQube 8.7, but the rest are already released. You can see them all at rules.sonarsource.com.

Finally, note that some of the rules perform in-depth automata-based analyses on your regular expressions to identify issues as accurately as possible. This is an extremely promising approach and we will probably discuss this feature in a future blog post, because we truly believe it’s amazing. This will for sure bring A LOT to regex code quality in general, so stay tuned!

---

This is the third installment in a series on what can go wrong in writing Regular Expressions:

Something to add? Join us in the community!