Regular expression gives a tool to answer whether the given string belong to a language set.

However, this is not enough. Our goal is to separate a long string into different token classes (different language sets)

Steps to do Lexical Analysis:

define regular expression for each token class (with some priority since else can be seen as keyword or identifier, since keyword has higher priority)

Construct R = Keyword + Identifier + ... by union

For 1 \leq i \leq n we check x_1 ... x_i \in L(R), and find the maximum i (maximum match is due to how human will read == as compairson rather than two assignments)

If succeed then x_1 ... x_i is in some token class, and find out which one by the priority

Remove x_1 ... x_i and repeat

If no rule handling, we need to error handel with nice message. So we do this by constructing regular expression for error strings with low priority. (We make sure x_1 ... x_i \in L(R) always hold for some i)

Regular expression is implemented using finite automata

Deterministic Finite Automata: no epsilon move, and one transition input per state (input determines path, faster)

Nondeterministic Finite Automata: epsilon move, can have multiple transitions for one input (input does not determine path, smaller)

From Regular Expression to NFA

Our convertion looks like follow:

Lexical Specification

Regular Expression

NFA

DFA

Transition to NFA:

\epsilon: a state transition to accepting state through \epsilon

a: a state transition to accepting state through a

AB: say we have machine for A and machine for B, to build AB, we modified final state of A to non-accepting state and epsilon transition to B

A+B: see image above

A*: see image above

We implement DFA using adjacency matrix or adjacency sequence (same as graph problem).