Error | Example | Detected |
---|---|---|
Lexical | ... $ ... | Lexer |
Syntax | ... x * % ... | Parser |
Semantic | ... int x; y = x(3); ... | Type Checker |
Error Handler:
report error accurately
recover quickly
no slow down for valid code
Panic Mode: skip detected error token and contine parsing from the next token that has a clear meaning.
Bison: use special terminal error
to describe input to skip, for example
E -> int | E + E | (E) | error int | (error)
Error Production: match common error just like regular syntax.
Example: we need to detect error 5x
instead of 5 * x
. Do so by adding production E -> ... | EE
Error Production complicates the grammar. But it is used in C++ to generate warnings but still parse to executable.
Compiler can help fix error. Not commonly used.
Error Forrection:
try insert or delete token by edit distance
exhaustive search
It is hard to implement and slow down parsing of correct program, and can generate unintended behavior.
In the past, recompilation is slow. So people need to detect as many error as possible, and perhaps help fix error. But now compilation is interactive, so user tend to correct one error per cycle.
Top Down Parsing: recursive descent algorithm (RDA
Terminal Checker
bool term(TOKEN tok) {
return *next++ == tok
}
To implement production rules:
E -> T | T + E
T -> int | int * T | (E)
We write:
bool E1() {return T();}
bool E2() {return T() && term(PLUS) && E();}
bool E() {
TOKEN *save = next;
return (E1()) || (next = save, E2());
}
bool T1() {return term(INT);}
bool T2() {return term(INT) && term(TIMES) && T();}
bool T3() {return term(OPEN) && E() && term(CLOSE);}
bool T() {
TOKEN *save = next;
return T1()
|| (next = save, T2())
|| (next = save, T3());
}
The above is incorrect since for input
int * int
, it will greedily match the firstint
and reject* int
. This is because once we accepted the firstint
, we can't back track to try other rules.The above algorithm only sufficient for grammars where for any non-terminal, at most one production can succeed. (In above case, both
INT
andINT * INT
can succeed)
// QUESTION: what to do?
Consider this production rule: S -> Sa | b
. This is a valid rule.
However, when we try to write the code, we get a infinite loop
bool S1() {return S() && term(a);}
bool S2() {return term(b);}
bool S() {
TOKEN *save = next;
return S1()
|| (next = save, S2());
}
Left recursion happens whenever we have a non-terminator as the first character in some matching production rule.
To solve this issue, we can rewrite the rule using right-recursion:
rewrite: S -> Sa | b
to:
S -> bS'
S' -> aS' | epsilon
A more general form:
rewrite: S -> Sa1 | ... | San | b1 | ... | bn
to:
S -> b1S' | ... | bmS'
S' -> a1S' | ... | anS' | epsilon
Other left recursion: can also be eliminated (See Dragon Book for general algorithm)
S -> Aa | c
A -> Sb
Recursive descent is simple and general parsing strategy, but left recursion must be eliminated. In principle, left recursion can be eliminated automatically. In practice, people eliminate by hand.
Table of Content