Home United States USA — software Parsing in Java (Part 1) : Structures, Trees, and Rules Parsing in...

Parsing in Java (Part 1) : Structures, Trees, and Rules Parsing in Java (Part 1) : Structures, Trees, and Rules

June 4, 2017

321

Join us as we embark on a mission to learn everything you ever wanted to know about Java parsers, including their architecture, ideal uses, and pitfalls.
If you need to parse a language, or document, from Java there are fundamentally three ways to solve the problem:
The first option is the best for well-known and supported languages, like XML or HTML. A good library usually also includes an API to programmatically build and modify documents in that language. This is typically more of what you get from a basic parser. The problem is that such libraries are not so common and they support only the most common languages. In other cases, you are out of luck.
You could need to go for the second option if you have particular needs. Both in the sense that the language you need to parse cannot be parsed with traditional parser generators, or you have specific requirements that you cannot satisfy using a typical parser generator. For instance, because you need the best possible performance or a deep integration between different components.
In all other cases, the third option should be the default one, because it is the one that is most flexible and has the shorter development time. That is why, in this article, we concentrate on the tools and libraries that correspond to this option.
We are going to see:
Parser generators (or parser combinators) are not trivial: You need some time to learn how to use them, and not all types of parser generators are suitable for all kinds of languages. That is why we have prepared a list of the best-known of them, with a short introduction for each of them. We are also concentrating on one target language: Java. This also means that (usually) the parser itself will be written in Java.
To list all possible tools and libraries parser for all languages would be kind of interesting, but not that useful. That is because there would be simply too many options, and we would all get lost in them. By concentrating on one programming language, we can provide an apples-to-apples comparison and help you choose one option for your project.
To make sure that this list is accessible to all programmers, we have prepared a short explanation of terms and concepts that you may encounter searching for a parser. We are not trying to give you formal explanations, but practical ones.
A lexer and a parser work in sequence: The lexer scans the input and produces the matching tokens, the parser scans the tokens and produces the parsing result.
Let’s look at the following example and imagine that we are trying to parse a mathematical operation.
The parser will typically combine the tokens produced by the lexer and group them.
It is now typical to find suites that can generate both a lexer and parser. In the past, it was instead more common to combine two different tools: One to produce the lexer and one to produce the parser. This was, for example, the case of the venerable lex & yacc couple: lex produced the lexer, while yacc produced the parser.
There are two terms that are related and sometimes they are used interchangeably: parse tree and Abstract SyntaxTree (AST) .
Conceptually they are very similar:
In the AST, some information is lost. For instance, comments and grouping symbols (parentheses) are not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.
A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementation of the parser. For instance, usually rules correspond to the type of a node. They are usually transformed in AST by the user, with some help from the parser generator.
A graphical representation of an AST looks like this.
Sometimes you may want to start producing a parse tree and then derive from it an AST. This can make sense because the parse tree is easier to produce for the parser (it is a direct representation of the parsing process) but the AST is simpler and easier to process via the following steps (and by the following steps, we mean all the operations that you may want to perform on the tree) : code validation, interpretation, compilation, etc..
A grammar is a formal description of a language that can be used to recognize its structure.
In simple terms, a grammar is a list of rules that define how each construct can be composed. For example, a rule for an if statement could specify that it must starts with the “if” keyword, followed by a left parenthesis, an expression, a right parenthesis, and a statement.
A rule could reference other rules or token types. In the example of the if statement, the keyword “if”, the left, and the right parenthesis were token types, while the expression and statement were references to other rules.
The is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contain other nonterminal symbols or terminal ones. Terminal symbols are simply the ones that do not appear as a anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like “class”.
In the context of parsers, an important feature is support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.
Consider for example arithmetic operations. An addition could be described as two expression (s) separated by the plus (+) symbol, but an expression could also contain other additions.
This description also matches multiple additions, like 5 + 4 + 3. That is because it can be interpreted as expression (5) ( ‘+’ ) expression (4+3) . And then 4 + 3 itself can be divided into its two components.
The problem is that these kinds of rules may not be used with some parser generators. The alternative is a long chain of expressions that takes care also of the precedence of operators.
Some parser generators support direct left-recursive rules, but not an indirect one.
A regular language can be defined by a series of regular expressions, while a context-free one needs something more. A simple rule of thumb is that if a grammar of a language has recursive elements it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.
Usually, there are regular grammars and context-free grammars that correspond respectively to regular and context-free languages.