# Lecture 004

## Using Regular Expression

Regular expression gives a tool to answer whether the given string belong to a language set.

However, this is not enough. Our goal is to separate a long string into different token classes (different language sets)

Steps to do Lexical Analysis:

1. define regular expression for each token class (with some priority since else can be seen as keyword or identifier, since keyword has higher priority)
2. Construct R = Keyword + Identifier + ... by union
3. For $1 \leq i \leq n$ we check $x_1 ... x_i \in L(R)$, and find the maximum $i$ (maximum match is due to how human will read == as compairson rather than two assignments)
4. If succeed then $x_1 ... x_i$ is in some token class, and find out which one by the priority
5. Remove $x_1 ... x_i$ and repeat
6. If no rule handling, we need to error handel with nice message. So we do this by constructing regular expression for error strings with low priority. (We make sure $x_1 ... x_i \in L(R)$ always hold for some $i$)

Regular expression is implemented using finite automata

Deterministic Finite Automata: no epsilon move, and one transition input per state (input determines path, faster)

Nondeterministic Finite Automata: epsilon move, can have multiple transitions for one input (input does not determine path, smaller)

### From Regular Expression to NFA

Our convertion looks like follow:

1. Lexical Specification
2. Regular Expression
3. NFA
4. DFA

Transition to NFA:

• $\epsilon$: a state transition to accepting state through $\epsilon$

• $a$: a state transition to accepting state through $a$

• $AB$: say we have machine for $A$ and machine for $B$, to build $AB$, we modified final state of $A$ to non-accepting state and epsilon transition to $B$

• $A+B$: see image above

• $A*$: see image above

We implement DFA using adjacency matrix or adjacency sequence (same as graph problem).

Or if you want to save space:

Table of Content