“It’s not a damn RIGHT-SHIFT, it’s two GT’s”: my fight with parsing Generic Java
# 2003-03-19 21:52:05 -0500 | Java | No CommentsSo, JDK1.5 is going to have parameterised types (I’ve spoken about this before). Yippee! But writing a parser for the new syntax almost broke me.
Quick refresher on parsing and lexing
When your source code gets compiled, or when XML or HTML is read by a program, it needs to be parsed. That is, the program has to go from thinking about the file as a sequence of bytes to thinking about it as some more complicated structure.
This “parsing” process is usually broken up into two phases; lexing then parsing.
Lexing turns a sequence of characters into a sequence of more manageable chunks. For example, this sequence of characters:
int i; // a float float j = 1.0;
gets turned into this sequence of tokens:
intan keywordian identifier/name;punctuationfloatanother keywordjan identifier/name=operator1.0numeric literal;punctuation
Notice how the “superfluous” whitespace and comments got discarded, and we are left with higher-level “chunks”; ’tis all good.
Next comes parsing. This is usually done via parsing rules. Rules describe how tokens should be assembled together; rules are often written in terms of other rules, but there are always tokens “at the leaves”. For example, the above tokens will match two applications of this “variable declaration” rule:
varDecl : (identifier | builtinType) IDENT ( "=" expr)? ;
identifier : IDENT ("." IDENT)* ;
builtinType :
"void" | "boolean" | "byte" | "char" | "short" |
"int" | "float" | "long" | "double" ;
expr : ( literal | ... // and so on
)
… // and so on
I have used the ANTLR syntax. ANTLR is a parser generator: it takes the above description of the grammar and generates Java code that will actually lex and parse a sequence of bytes for you. (ANTLR can also generate C++ and C#, as well as generate a HTML description of your grammar.)
Summary: understanding complicated file structures (like a Java source file) often involves lexing and parsing. But instead of writing lexers and parsers ourselves, we use parser generators.
The “problem” with the generics syntax
Obviously, the javac compiler needs to be able
to parse Java files in order to compile them to bytecode. But there
are other applications that might need to do this too; for example
Clover and
Checkstyle.
ANTLR comes bundled with an
example Java grammar
which puts you on a good footing for parsing Java code (and the parsers in
Clover and Checkstyle are both based on this grammar).
Now, as one of the developers of Clover, it fell to me to investigate upgrading our parser to support generics. (I also gave a presentation to my local JUG on generics recently, so I was all primed and ready for this task.)
And the problem, in a nut shell, with parsing generics is code like this:
// This code is valid Java under JSR14
// two lists of lists of String
List < List < String > > list1 = ...;
List<List<String>> list2 = …;
List1 is declared with a profusion of whitespace, and is
easy to parse naively. List2, on the other hand, will not parse
with a naive parser at all! Why? Consider this naive recursive rule for parsing
a generic type:
genericType : identifier (typeParams)? ; typeParams : "<" genericType ">" ;
This is a rule for parsing tokens; and list1 will parse
because it has two distinguishable “>” tokens (greater-than or GT tokens).
However, list2 does not have two distinguishable tokens; instead
it has just one SR (shift-right) token! You would get a parser exception like “expecting
GT but found SR”.
Oh boy! That really threw my naive parser out the lexical window!
>> instead of and,
|| instead of or“, C++-derived ethos; but it
sure adds complexity to the language, and makes a mess of Java’s parse-ability.
Also note that there is a similar problem with the unsigned right-shift operator >>>.
There are no problems with the relational >= or the
shift-assignment operators >>= >>>= simply because you cannot assign
to a type.
So how *do* you specify a parser for generic Java?
I posted that very question to the antlr-interest mailing list. I got some helpful hints from Monty (an ANTLR mailing-list stalwart). There are probably a few solutions, but the “quick win” for me looked like this:
// this works by allowing ">", ">>" or ">>>" to
// terminate a generic expression, and it need not match any
// at all.
// But "at the end of the day" we count to make sure there
// were the right amount.
genericType : identifier (typeParams)? ;
typeParams : "<" genericType (typeParamsCloser)?
// put a special "semantic check" to make sure we saw the
// right number of '>' at the end of the day.
;
typeParamsCloser : (">" | ">>" | ">>>")
// put a special "action" here to count how many '>' we saw
;
You can seem my completed parser on the list.