Token Class (or Class)
Englush: noun, verb, adjective, ...
Programming Language: Identifier, Keyword, (
, )
, =
(assignment), ...
(
only contain one character.==
, >=
Job of Lexical Analysis:
input: foo = 42
output: <Id, "foo">, <Op, "=">, <"Int", "42">
(in format or <token class, lexemes>
, and the whole thing is called a token)
Example: x=0;\n\twhile (x < 10) { \n \tx++; \n }
has
9 white space
1 keyword
3 identifier
2 number
9 other tokens
In Fortran, white space is completely ignored:
DO 5 I = 1,25
means it is a loop that ends with a label 5
with I
from 1
to 25
.
DO 5 I = 1.25
means an assignment 1.25
to variable DO5I
Lookahead (left to right scan): So in fortran, in order to understand whether DO
is a token on its own, we need to look for whether we have ,
or .
.
A good implementation of compiler avoids lookahead.
Fortran is deigned this way because we accidentally put blankspace in punchcard.
It is impossible to completely avoid lookahead because when looking at
=
, you don't know whether it will be==
in the end.
Programming Language 1: it is a programming language designed to have no keyword reserved. Consider the following program"
IF ELSE THEN THEN THEN = ELSE; ELSE ELSE = THEN
it is hard to distinguish keyword from identifier
DECLEAR(ARG1, ..., ARGN)
: DECLEAR
can either be a keyword or an array reference. We don't have sufficient info at this point, we need to read pass it. So since the number of ARG
is unbounded, this is an unbounded lookahead.
C++ template is also confusing: Foo<Bar<Bazz>>
(is >>
a rightshift or template?)
Unbounded lookahead: no way to bound the length of lookahead.
// QUESTION: why not just use white space??? Avoid lookahead?
We use regular language to define a token class. We use regular expression tyo match a lexemes to a token class.
The following is the grammar of regular expression
R = \epsilon
| `c` in Sigma
| R + R
| RR
| R*
Meaning Function: L : syntax -> semantics
Semantics: language set, which is a set of string
There are many ways to arrange a sentence to express the same idea.
Why we need meaning function:
make clear which is syntax, which is semantics
allow us to consider notation as a separate issue (ie. you can use roman numerals or integers to express numbers)
L
is a many-to-one function: this allow us to optimize our code (what we write) while not changing the behavior of our code (what we mean)
Note that
L
is never one-to-many. Otherwise, our programming language is ill-defined.
Regular Expression for Token Class:
Keyword: just write them and add into set
Digits: [0-9][0-9]*
Identifier: [a-zA-Z]([a-zA-Z][0-9])*
Whitespace: ( +\n\t)+
Some other regular expression syntax:
At least one: A+
equiv AA*
Union: A|B
equiv A+B
Option: A?
equiv A+\epsilon
Range: a+b+...+z
equiv [a-z]
Excluded range: complement of [a-z]
equiv [^a-z]
Table of Content