Lecture 006

Error Handling

Error	Example	Detected
Lexical	... $ ...	Lexer
Syntax	... x * % ...	Parser
Semantic	... int x; y = x(3); ...	Type Checker

Error Handler:

report error accurately
recover quickly
no slow down for valid code

Panic Mode

Panic Mode: skip detected error token and contine parsing from the next token that has a clear meaning.

Bison: use special terminal error to describe input to skip, for example

E -> int | E + E | (E) | error int | (error)

Error Productions

Error Production: match common error just like regular syntax.

Example: we need to detect error 5x instead of 5 * x. Do so by adding production E -> ... | EE

Error Production complicates the grammar. But it is used in C++ to generate warnings but still parse to executable.

Automatic Local or Global Correction

Compiler can help fix error. Not commonly used.

Error Forrection:

try insert or delete token by edit distance
exhaustive search

It is hard to implement and slow down parsing of correct program, and can generate unintended behavior.

In the past, recompilation is slow. So people need to detect as many error as possible, and perhaps help fix error. But now compilation is interactive, so user tend to correct one error per cycle.

Abstract Syntax Tree

Top Down Parsing: recursive descent algorithm (RDA

read token stream from left to right
from the first production rules to last production rule, try them sequentially
if not match, backtrack

Implementing Production Rule

Terminal Checker

bool term(TOKEN tok) {
  return *next++ == tok
}

To implement production rules:

E -> T | T + E
T -> int | int * T | (E)

We write:

bool E1() {return T();}
bool E2() {return T() && term(PLUS) && E();}

bool E() {
  TOKEN *save = next;
  return (E1()) || (next = save, E2());
}

bool T1() {return term(INT);}
bool T2() {return term(INT) && term(TIMES) && T();}
bool T3() {return term(OPEN) && E() && term(CLOSE);}

bool T() {
  TOKEN *save = next;
  return T1()
     || (next = save, T2())
     || (next = save, T3());
}

The above is incorrect since for input int * int, it will greedily match the first int and reject * int. This is because once we accepted the first int, we can't back track to try other rules.

The above algorithm only sufficient for grammars where for any non-terminal, at most one production can succeed. (In above case, both INT and INT * INT can succeed)

// QUESTION: what to do?

Left Recursion

Consider this production rule: S -> Sa | b. This is a valid rule.

However, when we try to write the code, we get a infinite loop

bool S1() {return S() && term(a);}
bool S2() {return term(b);}
bool S() {
  TOKEN *save = next;
  return S1()
     || (next = save, S2());
}

Left recursion happens whenever we have a non-terminator as the first character in some matching production rule.

To solve this issue, we can rewrite the rule using right-recursion:

rewrite: S -> Sa | b
to:
  S -> bS'
  S' -> aS' | epsilon

A more general form:

rewrite: S -> Sa1 | ... | San | b1 | ... | bn
to:
  S -> b1S' | ... | bmS'
  S' -> a1S' | ... | anS' | epsilon

Other left recursion: can also be eliminated (See Dragon Book for general algorithm)

S -> Aa | c
A -> Sb

Recursive descent is simple and general parsing strategy, but left recursion must be eliminated. In principle, left recursion can be eliminated automatically. In practice, people eliminate by hand.

Table of Content