Lecture 003

Lexical Analysis

Token Class (or Class)

Englush: noun, verb, adjective, ...
Programming Language: Identifier, Keyword, (, ), =(assignment), ...
- Identifier: string of letters or digits, starting with a letter (variables)
- Integer: a non-empty string of digits (numbers)
- keyword: "else", "if", "begin", ...
- whitespace: non-empty sequence of blanks, newline, tabs
- Some other token classes like ( only contain one character.
- Operator: ==, >=

Job of Lexical Analysis:

input: foo = 42
output: <Id, "foo">, <Op, "=">, <"Int", "42"> (in format or <token class, lexemes>, and the whole thing is called a token)

Example: x=0;\n\twhile (x < 10) { \n \tx++; \n } has

9 white space
1 keyword
3 identifier
2 number
9 other tokens

Fortran and Lookahead

In Fortran, white space is completely ignored:

DO 5 I = 1,25 means it is a loop that ends with a label 5 with I from 1 to 25.
DO 5 I = 1.25 means an assignment 1.25 to variable DO5I

Lookahead (left to right scan): So in fortran, in order to understand whether DO is a token on its own, we need to look for whether we have , or ..

A good implementation of compiler avoids lookahead.

Fortran is deigned this way because we accidentally put blankspace in punchcard.

It is impossible to completely avoid lookahead because when looking at =, you don't know whether it will be == in the end.

Programming Language 1: it is a programming language designed to have no keyword reserved. Consider the following program"

IF ELSE THEN THEN THEN = ELSE; ELSE ELSE = THEN it is hard to distinguish keyword from identifier
DECLEAR(ARG1, ..., ARGN): DECLEAR can either be a keyword or an array reference. We don't have sufficient info at this point, we need to read pass it. So since the number of ARG is unbounded, this is an unbounded lookahead.
C++ template is also confusing: Foo<Bar<Bazz>> (is >> a rightshift or template?)

Unbounded lookahead: no way to bound the length of lookahead.

// QUESTION: why not just use white space??? Avoid lookahead?

Regular Language

We use regular language to define a token class. We use regular expression tyo match a lexemes to a token class.

The following is the grammar of regular expression

R = \epsilon
  | `c` in Sigma
  | R + R
  | RR
  | R*

Meaning Function

Meaning Function: L : syntax -> semantics

Semantics: language set, which is a set of string

There are many ways to arrange a sentence to express the same idea.

Why we need meaning function:

make clear which is syntax, which is semantics
allow us to consider notation as a separate issue (ie. you can use roman numerals or integers to express numbers)
L is a many-to-one function: this allow us to optimize our code (what we write) while not changing the behavior of our code (what we mean)

Note that L is never one-to-many. Otherwise, our programming language is ill-defined.

Regular Expression for Token Classes

Regular Expression for Token Class:

Keyword: just write them and add into set
Digits: [0-9][0-9]*
Identifier: [a-zA-Z]([a-zA-Z][0-9])*
Whitespace: ( +\n\t)+

Some other regular expression syntax:

At least one: A+ equiv AA*
Union: A|B equiv A+B
Option: A? equiv A+\epsilon
Range: a+b+...+z equiv [a-z]
Excluded range: complement of [a-z] equiv [^a-z]

Table of Content