Lecture 005

Parsing

Note that the language $(^i)^i | i \in \mathbb{N}$ cannot be expressed using regular langauge and is common in programming language.

Parser: taking a strings of tokens and produce a parse tree

Note that in some compiler, Lexer and Parser are in one component.

Context Free Grammar: whenever we see things, replacing using the production rule is also in language

a set of terminals: $T$ (in the context of programming languages, terminals are tokens)
a set of non-terminals: $N$
a start symbol $S \in N$
a set of productions: $X \to Y_1 ... Y_N$ for $X \in N$ and $Y_i \in N \cup T \cup \{\epsilon\}$

How Context Free Grammar Work:

begin with a starting symbol $S$
replace any non-terminal $X$ in the string by the production rule
repeat until everything is a terminal

Example: in COOL, the lowercase are terminals and UPPERCASE are terminals EXPR -> if EXPR then EXPR else EXPR fi

EXPR -> while EXPR loop EXPR pool

EXPR -> id

The last one is just a identifier.

So the definition of context language $G$ is: with starting symbol $S$

$L(G) = \{a_1 ... a_n | \forall i (a_i \in T \land S \to_{\text{in any number of steps}} a_1 ... a_n)\}$

Example: paraphrasis can be expressed as: - starting symbol: $S$ - productions: $S \to (S), S \to \epsilon$ - terminals: $\{\{, \}\}$ - non-terminals: $\{S\}$

So far, context free grammar just give us the answer as decision problem, but we need to build a tree. And we need a nice error handling.

There will be cases it is necessary to modify the grammar for context-free parser to accept, unlike regular language.

Parse Tree: leaves are terminals and non-leaves are non-terminals. (For binary operation, the parse tree is trinary since you include your operation as token)

Left derivation: we always replacing left-most terminal
Right derivation: we always replacing right-most terminal
Other derivation: there can exist other derivation rules

Ambiguous: if a grammar has more than one parse tree

Example: id * id + id is ambiguous with respect to the following rules

$E \to E * E$
$E \to E + E$

When a programming language is ambiguous, you are leaving up to compiler to pick one of many possible interpretation.

To fix ambiguous grammar, we can enforce an order by using $E'$ instead of $E$ .

Example: we want

if
  E
then
  if E then E else E

where the else is associated tho the unmatched then. The following rule is used:

E -> MIF // matched then
   | UIF // unmatched then
MIF -> if E then MIF else MIF
     | OTHER
UIF -> if E then E
     | if E then MIF else UIF

It is impossible to automatically convert an ambiguous grammar to an unambiguous one. Usually, we define left or right associative %left +, %left *.

However, the parser does not understand associativity and does not always behave like associativity. Be caution!

Table of Content