Home United States USA — software Parsing in Python: Tools and Libraries (Part 2)

Parsing in Python: Tools and Libraries (Part 2)

December 31, 2017

147

Learn about tools and libraries for parsing in Python, useful things to know about parsers, and types of languages and grammars.
Check out Part 1 here!
In simple terms, a grammar is a list of rules that define how each construct can be composed. For example, a rule for an « if » statement could specify that it must start with the « if » keyword, followed by a left parenthesis, an expression, a right parenthesis, and a statement.
A rule could reference other rules or token types. In the example of the « if » statement, the keyword « if » and the left and the right parenthesis were token types, while the expression and the statement were references to other rules.
The most-used format to describe grammars is the Backus-Naur form (BNF), which also has many variants, including the extended Backus-Naur form. The extended variant has the advantage of including a simple way to denote repetitions. A typical rule in a Backus-Naur grammar looks like this:
The is usually nonterminal, which means that it can be replaced by the group of elements on the right, __expression__. The element __expression__ could contain other nonterminal symbols, or terminal ones. Terminal symbols are simply the ones that do not appear as a anywhere in the grammar. A typical example of a terminal symbol is a string of characters, like class.
In the context of parsers, an important feature is support for left-recursive rules. This means that a rule could start with a reference to itself. This reference could be also indirect.
Consider, for example, arithmetic operations. An addition could be described as two expression(s) separated by the plus (+) symbol, but an expression could also contain other additions.
This description also match multiple additions like 5 + 4 + 3. That is because it can be interpreted as expression (5) (‘+’) expression (4+3) and then 4 + 3 itself can be divided into its two components.
The problem is that this kind of rule may not be used with some parser generators. The alternative is a long chain of expressions that also take care of the precedence of operators.
Some parser generators support direct left-recursive rules, but not indirect ones.
A regular language can be defined by a series of regular expressions, while a context-free one needs something more. A simple rule of thumb is that if a grammar of a language has recursive elements, it is not a regular language. For instance, as we said elsewhere, HTML is not a regular language. In fact, most programming languages are context-free languages.
Usually, a language and grammar correspond to each other. That is to say, there are regular grammars and context-free grammars that correspond respectively to regular and context-free languages. But to complicate matters, there is a relatively new (created in 2004) kind of grammar called the parsing expression grammar (PEG). These grammars are as powerful as context-free grammars, but according to their authors, they describe programming languages more naturally.
The main difference between PEG and CFG is that the ordering of choices is meaningful in PEG, but not in CFG. If there are many possible valid ways to parse an input, a CFG will be ambiguous and thus wrong. Instead, with PEG, the first applicable choice will be chosen, which automatically solves some ambiguities.
Another difference is that PEG uses scannerless parsers; they do not need a separate lexer or lexical analysis phase.
Traditionally, both PEG and some CFG have been unable to deal with left-recursive rules, but some tools have found workarounds for this — either by modifying the basic parsing algorithm or by having the tool automatically rewrite a left-recursive rule in a nonrecursive way. Both of these workarounds have their own downsides: either making the generated parser less intelligible or worsening its performance. However, in practical terms, the advantages of easier and quicker development outweigh the drawbacks.
Stay tuned for Part 3, where we’ll talk about parser generators!