Predictive Parsing: recursive-descent but parser can "predict" which production rule to use
by lookahead
never guess wrong
never backtrack
LL(k) grammar: left-to-write parsing, left-most derivation, look ahead k tokens.
Predictive parser accept LL(k) grammar. In practice, we only use k = 1.
Predictive Parsing is just a grammar rewrite so that we always know which rule to apply:
from:
E -> T + E | T
T -> int | int * T | (E)
to:
E -> TE'
E' -> +E | epsilon
T -> int T' | (E)
T' -> epsilon | * T
We would like to have a look up table, taking a terminal or non-terminal and a token, tell us what rules to apply.
For example, we would like to have the table of the following:
int | * | + | ( | ) | $ | |
---|---|---|---|---|---|---|
E | TX | TX | ||||
X | +E | eps | eps | |||
T | int Y | (E) | ||||
Y | * T | eps | eps | eps |
But how can we generate the table?
We populates T[A, t] = alpha
in two cases:
First Set (t \in \text{First}(\alpha)): populate \iff \alpha \to^* t \beta
Follow Set (t \in \text{Follow}(A)): populate \iff S \to^* \beta A t \sigma \land A \to \alpha \land \alpha \to^* \epsilon
First Set: given a string X (mix of terminor or non-terminal), then the first set (which contains some terminals or epsilon) is:
Intuitively, first set tells you what a string of code can be parsed from.
Algorithm Sketch:
\text{First}(t) = t
\epsilon \in \text{First}(X) if X \to \epsilon \lor (X \to A_1 ... A_n \land (\forall 1 \leq i \leq n)(\epsilon \in \text{First}(A_i)))
\text{First}(\alpha) \subseteq \text{First}(X) if (X \to A_1 ... A_n\alpha \land (\forall 1 \leq i \leq n)(\epsilon \in \text{First}(A_i)))
Example: give the first set for the grammar below:
E -> TX
T -> (E) | int Y
X -> + E | eps
Y -> * T | eps
Firstly, the first set of terminals is their singleton set:
First(+) = {+}
First(*) = {*}
First(()) = {()}
First()) = {)}
First(int) = {int}
The first set of non-terminal contains the first set of first character:
First(E) contains First(T)
First(T) contains First(() and First(int) = {(, int}
Now since First(T) does not have eps, we should not add First(X) to First(E).
So First(E) = First(T) = {(, int}
First(X) = {+, eps}
First(Y) = {*, eps}
Follow Set: what token can follow S
Observation:
\text{First}(B) \subseteq \text{Follow}(A) \land \text{Follow}(X) \subseteq \text{Follow}(B) if X \to AB
\text{Follow}(X) \subseteq \text{Follow}(A) if X \to AB \land B \to^* \epsilon
\$ \in \text{Follow}(S) if S is the start symbol.
Algorithm Sketch
\$ \in \text{Follow}(S)
\text{First}(\beta) - \{\epsilon\} \subseteq \text{Follow}(X) for each production A \to \alpha X \beta
\text{Follow}(A) \subseteq \text{Follow}(X) for each production A \to \alpha X \beta where \epsilon \in \text{First}(\beta)
For each production rule A \to \alpha, do:
for each terminal t \in \text{First}(\alpha), T[A, t] = a
if \epsilon \in \text{First}(\alpha), for each t \in \text{Follow}(A), T[A, t] = a
if \epsilon \in \text{First}(\alpha) \land \$ \in \text{Follow}(A), T[A, $] = a
Note that LL(1) parsing table can only be built for LL(1) grammar.
The only mechanical way to check for LL(1) grammar is to build the parsing table. (although quick checks includes: non-ambiguous, non-left-recursive, non-left-factored, and more)
LL(1) grammar is too weak to describe modern languages.
Bottom-up Parsing is the preferred method, can be just as efficient. It is more general than (deterministic) top-down parsing.
Bottom-up Parsing: reduces a string to the starting symbol by inverting production rules. (reduction)
Bottom-up parser traces a rightmost derivation in reverse.
Note that we try to expand right most derivation (e.g. we choose to parse
E
first inT + E
)
Consequence of right-most derivation: when you see \alpha \beta \omega, then if the next production rule to apply reversely is X \to \beta, then we know \omega must be a terminal.
So we can let the \omega be terminal come from the input steam of tokens!
Shift: add a token from the token steam to our working set from the right hand side.
Reduce: reversely apply production rule to the left hand side.
How do we know when and where to shift and reduce?
Shift-reduce Conflict: when the parse is free to choose either do shift or do reduce in the next round. (almost expected)
Reduce-reduce Conflict: when the parser is possible to perform more than one possible reduce rules, indicating the grammar is bad.
Shift pushes a terminal onto stack. Reduce pop 0 or more symbols off the stack and push produced symbols on the stack.
Table of Content