I’ve been doing a fair whack of regex processing in Java recently. I must admit that having a regex library in the language is very useful, and Java’s java.util.regex.Pattern implements a very complete regex library (all the “Perl5” stuff, non-greedy quantifiers, Unicode character classes, etc).

So I was completely surprised when it started throwing StackOverflowError. The following code highlights the problem (reproduced on Sun’s JDK1.5.0-b2 and JDK1.4.2_01). Any pattern like (x|y)* (an alternative wrapped in a star) is implemented recursively by Pattern. Each matched alternative uses about 5 method calls on the stack.

Ai Carumba!

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexpFun {
    public static void main(String[] args) {

        CharSequence largeString = makeLargeString();
        Pattern p = Pattern.compile("(x|y)*");
        Matcher m = p.matcher(largeString);
        System.out.println("matches? " + m.matches()); // throws StackOverflowError
    }

    private static CharSequence makeLargeString() {
        final int size = 1000;
        StringBuffer largeString = new StringBuffer(size);
        for (int i = 0; i < size/2; i++) {
            largeString.append("xy");
        }
        return largeString;
    }
}

Update: I know I could use [xy]* instead of (x|y)*, but my x and y are complicated regular expressions themselves.