Regular expression gives a tool to answer whether the given string belong to a language set.
However, this is not enough. Our goal is to separate a long string into different token classes (different language sets)
Steps to do Lexical Analysis:
define regular expression for each token class (with some priority since else can be seen as keyword or identifier, since keyword has higher priority)
Construct R = Keyword + Identifier + ... by union
For 1 \leq i \leq n we check x_1 ... x_i \in L(R), and find the maximum i (maximum match is due to how human will read == as compairson rather than two assignments)
If succeed then x_1 ... x_i is in some token class, and find out which one by the priority
Remove x_1 ... x_i and repeat
If no rule handling, we need to error handel with nice message. So we do this by constructing regular expression for error strings with low priority. (We make sure x_1 ... x_i \in L(R) always hold for some i)
Regular expression is implemented using finite automata
Small Example of Finite Automata
Deterministic Finite Automata: no epsilon move, and one transition input per state (input determines path, faster)
Nondeterministic Finite Automata: epsilon move, can have multiple transitions for one input (input does not determine path, smaller)
From Regular Expression to NFA
Our convertion looks like follow:
Lexical Specification
Regular Expression
NFA
DFA
A Recursive
A star
Transition to NFA:
\epsilon: a state transition to accepting state through \epsilon
a: a state transition to accepting state through a
AB: say we have machine for A and machine for B, to build AB, we modified final state of A to non-accepting state and epsilon transition to B
A+B: see image above
A*: see image above
Convert NFA to DFA
We implement DFA using adjacency matrix or adjacency sequence (same as graph problem).
Or if you want to save space:
Impelement DFA using a transition to a set of states