Detecting and Preventing ReDoS Vulnerabilities, Part 2

JRoller post from 1. March 2011 (the original post is available on the internet archive):

In my previous post on ReDoS, I tried to explain the problem and proposed a solution for fixing regular expression engines by reducing the execution time and recursion depth of the execution. Now I want to explain some strategies to detect vulnerable expressions. In general, there are three approaches to achieve this:

  1. Fuzzing;
  2. Generating sample inputs for the expression from the syntax and testing those against the engine;
  3. Static analysis of the expression syntax.

All approaches have their pros and cons and I will dig into them in more detail.

Fuzzing

Fuzzing is a very simple approach by which the execution of an expression is tested against random samples of input. Fuzzing has been a very popular way of finding vulnerabilities in all sorts of cases. The advantage is that it is very simple to write a fuzzer. The disadvantage is that most samples don’t even faintly match the expression so it might take very long to find an input that exposes a vulnerability. The second problem is that you can find vulnerabilities but cannot show that an expression is not vulnerable. Microsoft’s SDL Regex Fuzzer is an example of such a tool.

Generating Sample Inputs from the Expression Syntax

This approach is similar to model checking, a technique used in program validation. In short, the tool checks every possible state the NFA can be in and thus verifies that the engine evaluates the expression in reasonable time.

In the case of regular expressions, all we need to do is parse the expression and generate an abstract syntax tree. From the syntax tree, we generate all possible input samples that match the expression. Then, we add an end character to each sample that forces the engine to go through all possible paths of evaluation and if the engine behaves well for all possible samples, we can show with some confidence that the expression is not vulnerable.

As an example, we want to check the following expression:

^a(b|c)$

Now we generate all possible input samples that match the expression:

ab
ac

Then, we add an end character to each sample such that it doesn’t match the expression:

abX
acX

Finally, we test each of this samples against the engine and thus can be pretty sure that the expression is not vulnerable.

The limitation here is that we assume that there is exactly one non matching end character that causes the engine to freeze. The assumption is somewhat justified by the examples discovered by several researchers but we can’t be sure that there isn’t an input that doesn’t match our assumption but still exploits a vulnerability.

I have recently written a tool that does exactly this. It’s called saferegex and I published it under Apache 2.0 license. Although the tool is still in early stages it finds vulnerable expression with usually reasonable effort. I had to apply several tweaks to reduce the number of generated samples but nontheless, it’s pretty quick compared to fuzzers. Here’s a sample session:

> java -jar saferegex.jar "abc|def"
2 samples found.

Tests: 2
Broken samples: 0
This expression is not vulnerable.

Now for a more interesting expression:

> java -jar saferegex.jar "(a|a?)+"
Testing: (a|a?)+
More than 10000 samples found.
******
This expression is vulnerable.
Sample input: aaaaaaaaaaaaaaaaaaaaaab

As the tool is in its early stages, the current implementation has several limitations. The most important one is that only basic regular expressions are partly supported. So there is currently no support for word boundaries, backreferences, POSIX and Unicode character classes. However, there are neither theoretical nor practical reasons why they couldn’t be implemented so the present solution serves pretty well as a proof of concept.

Static Analysis of the Expression Syntax

With this approach, the expression is parsed and transformed into an abstract syntax tree. Then, the syntax tree is analyzed according to some predefined patterns. The regular expression engine itself doesn’t get directly involved in the analysis. The advantage of this approach is that it’s much faster than all the other approaches. A necessary precondition for designing such a tool is that reliable patterns exist for expressions that lead to those vulnerabilities. In the moment, the following patterns are known and could be implemented in a static analysis tool:

  1. Overlapping repetitions such as a*ab*. Here, the character a matches the first expression and partly the second.
  2. Nested repetitions such as (a+)+.
  3. Repetitions with overlapping options such as (a|aa)+.

Another advantage of static analysis is that such a tool can point the developer to the cause of problem. This would make it easy for developers to fix the vulnerable expression. On the other hand, there are several problems with this approach that are all but easy to solve:

  • At first sight, the rules above look simple but it is all but trivial to generalize them and apply them to complex real world expressions.
  • The limitations mentioned above could lead to a high rate of false positives and negatives. I have analyzed some expressions that seem vulnerable according to the rules above but turned out to be safe as far as I can tell.
  • Regular expression engines differ in their implementations and expressions that are vulnerable on one engine aren’t necessary vulnerable on others.

Despite all the disadvantages, I think that static analysis is a very promising approach and the tool mentioned in the previous section is already doing the first step that’s required for static analysis: Parsing the expression and turning it into a syntax tree.

Conclusion

ReDoS vulnerabilities seem easy to detect at first but once you dig deeper into it, things are getting pretty complicated. I am convinced that in the next several months and years, new tools will be invented and existing ones will be improved to reliably detect those vulnerabilities. However, the safest prevention mechanism would be to fix the regular expression engines themselves, as I have illustrated in the previous post. Browsers such as Firefox already stop script execution if it takes too long so it already has a built in mechanism to cope with the problem to some degree. Now regular expression engines in Java and other languages should follow this path and interrupt execution automatically after some time.

One thought on “Detecting and Preventing ReDoS Vulnerabilities, Part 2

Leave a comment