Regular expressions present challenges even for not-so-regular developers
Regular expressions are a concise and powerful tool for processing text. However, they also come with a steep learning curve and plenty of opportunities to make mistakes.
This is the first in a series of posts about some specific pitfalls of Java regular expressions that can lead to bugs, code that’s hard to understand, or worse: code that could crash your application. In this series we will give you some examples of issues in real code caused by these pitfalls, and discuss strategies (and rules!) for writing better, more readable and maintainable regular expressions. In this post I’ll start with pitfalls related to a very common feature of regular expressions: character classes.
Note that writing this blog post has been made possible thanks to the group effort of the whole SonarSource Java analysis team. Transforming our initial ideas into such features is a great collective achievement, which I’ll now share with you, speaking for the team!
Character classes allow the regex engine to match only one out of several characters. For instance:
- The character class [xy] can match either an x or a y.
- You can also use ranges inside character classes: [e-p] matches any character between e and p.
- You can inverse or negate character classes with ^: By starting the character class with a single ^ you negate everything that follows in the class. So [^a-z] matches anything that's not a lowercase ASCII letter.
Where it starts to be tricky is that some characters have different meanings inside character classes than they do outside. The best example of this is probably the hyphen/minus character - which gains the special meaning of creating ranges when used inside a character class. To match a literal -, you can escape it \- or move the - to the beginning or end of the character class. Another example is the multipliers. For instance outside a character class, * means "repeated any number of times". Inside a character class it just means "asterisk".
You probably think this "Character Classes" concept is easy and well understood by developers. However, after running our analyzer on a few GitHub open-source projects, we realized that it might not be the case at all. So let's take a look at real code and see how creative developers can be!
Problem 1: Wrong use of separators
There is a lot of confusion around the | character. Outside of a character class, it is an alternation operator. So it would allow you to select "red" or "blue", like so: red|blue. But inside a character class, it's just a normal character with no special behavior. For example in this “mobile-phone number” matcher:
The author should replace [3|4|5|7|8] with  in the pattern.
Other developers make the same mistake with commas, as in this example:
And the negation symbol ^ should only be used at the beginning of the character class and not before each element, like in this NanoHTTPD code:
Problem 2: Wrong character
A more subtle potential bug is the uppercase and lowercase mix in character ranges, like in the Apache Camel code:
Do you see the bug? Not the wrong | use, the other one? Because of the second lower-case z, the range [A-z] matches characters in the ASCII table from A to Z, plus [, \, ], ^, _, `, and adds from a to z on top of that. Isn't it strange? So now it should take you only one second to find a bug in this Elasticsearch code which is commented "defined by RFC7230 section 3.2.6" for this expression:
Unfortunately, RFC7230 does not allow [, \, ] in HTTP header field values, so it's definitely a bug. A similar bug could also occur when you want to match the character - and forget to escape it or move it to the first position in the class (where it would lose its special meaning). Can you spot which - character is wrong in the following Jenkins code?
USERINFO_CHARS_REGEX = "[a-zA-Z0-9%-._~!$&'()*+,;=]";
It's the one in the range %-.; it does not match 3 characters but %&'()*+,-. and because the matched characters are also present after in the character class, we know that the range %-. was not intentional. Luckily, this expression will only fail to match the character -, but sometimes this confusion can have a bigger impact:
String safetextRegex = "^[a-zA-Z0-9 .,;-_€@$äÄöÖüÜ!?#&=]+$";
Nice variable name, but unfortunately this character class is most probably not as safe as expected by its initial writer. Indeed, here ;-_ does not match 3 characters, but 37!
And don't forget that a range can only match one and only one character. If you want to match characters '0' '1' '2' '3', you can use [0-3]. But what do you think the following Apache Hadoop code is supposed to match?
1 could just be a redundancy and not a bug. But, if the intention was to match an acl number as defined by Intel from acl0 to acl31, then it's a bug. Likewise, matching uppercase and lowercase requires two character ranges [A-Za-z] and not only one like in this Apache Geode code:
Problem 3: Wrong regex operator
Sometimes alternations like (jpg|png|gif) are wrongly written using character classes. Can you spot the bug in the following Alibaba's Tangram source code?
Good to know, * and ? are just normal characters when used in character classes and lose their meaning as quantifiers. So in this next example, why would you add a ? inside a character class?
String VALUE = "[[^\"]?]+"; // anything but a " in ""
It's a complicated way to write [^\"]+, and probably the intention was actually to write [^\"]*.
The above bugs were found by our new rule java:S5869 - Character classes in regular expressions should not contain the same character twice. The initial goal of this rule was to spot tiny misunderstandings like:
But in the end, the findings far exceeded our expectations and will ultimately prevent some very painful bugs in your applications. S5869 is available today in SonarQube, SonarCloud and SonarLint.
It was Voltaire who first said that with great power comes great responsibility. But what we've learned in implementing rules for regular expressions is that with the great power of regular expressions, also come great challenges to write them well. In this post I talked about what we found with rule S5869, but it's only one of the regex rules we've been working on. Next time I'll talk about regex boundaries and complexity.
This is the first installment in a series on what can go wrong in writing Regular Expressions:
- Regular expressions present challenges even for not-so-regular developers
- Setting the right (regex) boundaries is important
- Crafting regexes to avoid stack overflows
Something to add? Join the conversation in the community.