Lecture 004

Using Regular Expression

Regular expression gives a tool to answer whether the given string belong to a language set.

However, this is not enough. Our goal is to separate a long string into different token classes (different language sets)

Steps to do Lexical Analysis:

  1. define regular expression for each token class (with some priority since else can be seen as keyword or identifier, since keyword has higher priority)
  2. Construct R = Keyword + Identifier + ... by union
  3. For 1 \leq i \leq n we check x_1 ... x_i \in L(R), and find the maximum i (maximum match is due to how human will read == as compairson rather than two assignments)
  4. If succeed then x_1 ... x_i is in some token class, and find out which one by the priority
  5. Remove x_1 ... x_i and repeat
  6. If no rule handling, we need to error handel with nice message. So we do this by constructing regular expression for error strings with low priority. (We make sure x_1 ... x_i \in L(R) always hold for some i)

Regular expression is implemented using finite automata

Small Example of Finite Automata

Small Example of Finite Automata

Deterministic Finite Automata: no epsilon move, and one transition input per state (input determines path, faster)

Nondeterministic Finite Automata: epsilon move, can have multiple transitions for one input (input does not determine path, smaller)

From Regular Expression to NFA

Our convertion looks like follow:

  1. Lexical Specification
  2. Regular Expression
  3. NFA
  4. DFA

A Recursive

A Recursive

A star

A star

Transition to NFA:

Convert NFA to DFA

Convert NFA to DFA

We implement DFA using adjacency matrix or adjacency sequence (same as graph problem).

Or if you want to save space:

Impelement DFA using a transition to a set of states

Impelement DFA using a transition to a set of states

Table of Content