Lecture 007

Predictive Parsing

Predictive Parsing: recursive-descent but parser can "predict" which production rule to use

by lookahead
never guess wrong
never backtrack

LL(k) grammar: left-to-write parsing, left-most derivation, look ahead $k$ tokens.

Predictive parser accept LL(k) grammar. In practice, we only use $k = 1$ .

Predictive Parsing is just a grammar rewrite so that we always know which rule to apply:

from:
  E -> T + E | T
  T -> int | int * T | (E)

to:
  E -> TE'
  E' -> +E | epsilon
  T -> int T' | (E)
  T' -> epsilon | * T

Parse Table

We would like to have a look up table, taking a terminal or non-terminal and a token, tell us what rules to apply.

For example, we would like to have the table of the following:

	int	*	+	(	)	$
E	TX			TX
X			+E		eps	eps
T	int Y			(E)
Y		* T	eps		eps	eps

But how can we generate the table?

We populates T[A, t] = alpha in two cases:

First Set ( $t \in \text{First}(\alpha)$ ): populate $\iff \alpha \to^* t \beta$
Follow Set ( $t \in \text{Follow}(A)$ ): populate $\iff S \to^* \beta A t \sigma \land A \to \alpha \land \alpha \to^* \epsilon$

First Set

First Set: given a string $X$ (mix of terminor or non-terminal), then the first set (which contains some terminals or epsilon) is:

$\text{First}(X) = \{t | X \to^* t\alpha\} \cup \{\epsilon | X \to^* \epsilon\}$

Intuitively, first set tells you what a string of code can be parsed from.

Algorithm Sketch:

$\text{First}(t) = t$
$\epsilon \in \text{First}(X)$ if $X \to \epsilon \lor (X \to A_1 ... A_n \land (\forall 1 \leq i \leq n)(\epsilon \in \text{First}(A_i)))$
$\text{First}(\alpha) \subseteq \text{First}(X)$ if $(X \to A_1 ... A_n\alpha \land (\forall 1 \leq i \leq n)(\epsilon \in \text{First}(A_i)))$

Example: give the first set for the grammar below:

E -> TX
T -> (E) |  int Y
X -> + E | eps
Y -> * T | eps

Firstly, the first set of terminals is their singleton set:

First(+) = {+}
First(*) = {*}
First(()) = {()}
First()) = {)}
First(int) = {int}

The first set of non-terminal contains the first set of first character:

First(E) contains First(T)
First(T) contains First(() and First(int) = {(, int}

Now since First(T) does not have eps, we should not add First(X) to First(E).

So First(E) = First(T) = {(, int}

First(X) = {+, eps}
First(Y) = {*, eps}

Follow Set

Follow Set: what token can follow $S$

$\text{Follow}(X) = \{t | S \to^* \beta X t \delta\}$

Observation:

$\text{First}(B) \subseteq \text{Follow}(A) \land \text{Follow}(X) \subseteq \text{Follow}(B)$ if $X \to AB$
$\text{Follow}(X) \subseteq \text{Follow}(A)$ if $X \to AB \land B \to^* \epsilon$
$\$ \in \text{Follow}(S)$ if $S$ is the start symbol.

Algorithm Sketch

$\$ \in \text{Follow}(S)$
$\text{First}(\beta) - \{\epsilon\} \subseteq \text{Follow}(X)$ for each production $A \to \alpha X \beta$
$\text{Follow}(A) \subseteq \text{Follow}(X)$ for each production $A \to \alpha X \beta$ where $\epsilon \in \text{First}(\beta)$

Parsing Table Construction

For each production rule $A \to \alpha$ , do:

for each terminal $t \in \text{First}(\alpha)$ , T[A, t] = a
if $\epsilon \in \text{First}(\alpha)$ , for each $t \in \text{Follow}(A)$ , T[A, t] = a
if $\epsilon \in \text{First}(\alpha) \land \$ \in \text{Follow}(A)$ , T[A, $] = a

Note that LL(1) parsing table can only be built for LL(1) grammar.
Example of non-LL(1) invalid Parsing Table

The only mechanical way to check for LL(1) grammar is to build the parsing table. (although quick checks includes: non-ambiguous, non-left-recursive, non-left-factored, and more)

LL(1) grammar is too weak to describe modern languages.

Bottom-Up Parsing

Bottom-up Parsing is the preferred method, can be just as efficient. It is more general than (deterministic) top-down parsing.

Bottom-up Parsing: reduces a string to the starting symbol by inverting production rules. (reduction)

Bottom-up parser traces a rightmost derivation in reverse.

Note that we try to expand right most derivation (e.g. we choose to parse E first in T + E)

Shift-Reduce Parsing

Consequence of right-most derivation: when you see $\alpha \beta \omega$ , then if the next production rule to apply reversely is $X \to \beta$ , then we know $\omega$ must be a terminal.

So we can let the $\omega$ be terminal come from the input steam of tokens!

Shift: add a token from the token steam to our working set from the right hand side.

Reduce: reversely apply production rule to the left hand side.

How do we know when and where to shift and reduce?

Shift-reduce Conflict: when the parse is free to choose either do shift or do reduce in the next round. (almost expected)

Reduce-reduce Conflict: when the parser is possible to perform more than one possible reduce rules, indicating the grammar is bad.

Shift pushes a terminal onto stack. Reduce pop 0 or more symbols off the stack and push produced symbols on the stack.

Table of Content