madbean

"It's not a damn RIGHT-SHIFT, it's two GT's": my fight with parsing Generic Java

19 Mar 2003

So, JDK1.5 is going to have parameterised types (I've spoken about this before). Yippee! But writing a parser for the new syntax almost broke me.

Quick refresher on parsing and lexing

When your source code gets compiled, or when XML or HTML is read by a program, it needs to be parsed. That is, the program has to go from thinking about the file as a sequence of bytes to thinking about it as some more complicated structure.

This "parsing" process is usually broken up into two phases; lexing then parsing.

Lexing turns a sequence of characters into a sequence of more manageable chunks. For example, this sequence of characters:

int i;
// a float
float j = 1.0;

gets turned into this sequence of tokens:

  • int an keyword
  • i an identifier/name
  • ; punctuation
  • float another keyword
  • j an identifier/name
  • = operator
  • 1.0 numeric literal
  • ; punctuation

Notice how the "superfluous" whitespace and comments got discarded, and we are left with higher-level "chunks"; 'tis all good.

Next comes parsing. This is usually done via parsing rules. Rules describe how tokens should be assembled together; rules are often written in terms of other rules, but there are always tokens "at the leaves". For example, the above tokens will match two applications of this "variable declaration" rule:

varDecl : (identifier | builtinType) IDENT ( "=" expr)? ;
identifier : IDENT ("." IDENT)* ;
builtinType :
    "void" | "boolean" | "byte" | "char" | "short" |
    "int" | "float" | "long" | "double" ;
expr : ( literal | ... // and so on
       )
... // and so on

I have used the ANTLR syntax. ANTLR is a parser generator: it takes the above description of the grammar and generates Java code that will actually lex and parse a sequence of bytes for you. (ANTLR can also generate C++ and C#, as well as generate a HTML description of your grammar.)

Summary: understanding complicated file structures (like a Java source file) often involves lexing and parsing. But instead of writing lexers and parsers ourselves, we use parser generators.

The "problem" with the generics syntax

Obviously, the javac compiler needs to be able to parse Java files in order to compile them to bytecode. But there are other applications that might need to do this too; for example Clover and Checkstyle. ANTLR comes bundled with an example Java grammar which puts you on a good footing for parsing Java code (and the parsers in Clover and Checkstyle are both based on this grammar).

Now, as one of the developers of Clover, it fell to me to investigate upgrading our parser to support generics. (I also gave a presentation to my local JUG on generics recently, so I was all primed and ready for this task.)

And the problem, in a nut shell, with parsing generics is code like this:

// This code is valid Java under JSR14
// two lists of lists of String
List < List < String > > list1 = ...;
List<List<String>> list2 = ...;

List1 is declared with a profusion of whitespace, and is easy to parse naively. List2, on the other hand, will not parse with a naive parser at all! Why? Consider this naive recursive rule for parsing a generic type:

genericType : identifier (typeParams)? ;
typeParams : "<" genericType ">" ;

This is a rule for parsing tokens; and list1 will parse because it has two distinguishable ">" tokens (greater-than or GT tokens). However, list2 does not have two distinguishable tokens; instead it has just one SR (shift-right) token! You would get a parser exception like "expecting GT but found SR".

Oh boy! That really threw my naive parser out the lexical window!

The new generics syntax using angle-brackets is concise and probably fits in with the whole curly-bracing, ">> instead of and, || instead of or", C++-derived ethos; but it sure adds complexity to the language, and makes a mess of Java's parse-ability.

Also note that there is a similar problem with the unsigned right-shift operator >>>. There are no problems with the relational >= or the shift-assignment operators >>= >>>= simply because you cannot assign to a type.

So how *do* you specify a parser for generic Java?

I posted that very question to the antlr-interest mailing list. I got some helpful hints from Monty (an ANTLR mailing-list stalwart). There are probably a few solutions, but the "quick win" for me looked like this:

// this works by allowing ">", ">>" or ">>>" to
// terminate a generic expression, and it need not match any
// at all.
// But "at the end of the day" we count to make sure there
// were the right amount.
genericType : identifier (typeParams)? ;
typeParams : "<" genericType (typeParamsCloser)?
    // put a special "semantic check" to make sure we saw the
    // right number of '>' at the end of the day.
    ;
typeParamsCloser : (">" | ">>" | ">>>")
    // put a special "action" here to count how many '>' we saw
    ;

You can seem my completed parser on the list.

  • Home
  • Blog