Content

Overview
Scanner Implementations

Regular Expressions

                      1. RE Ambiguities
                      2. Finite Automata
                                                   1. RE -> NFA
                                                   2. NFA -> DFA
                                                                                1. Error States
                                                   3. DFA -> RE
                                                                                1. Transitive Closure Method
                                                                                2. State Removal Method
                                                                                3. Algebraic Method (Brzozowski)
                      4. Limitations

Overview

source language --> |lexical analyzer| --> token stream
                             |
                           error

Scanner Implementations

Loop & switch (handwritten)

                      - Efficient
                      - Hard to change

Regular expressions + finite automata

                      - _flex_ can generate scanner from RE

Regular Expressions

Epsilon ε: {""}
Symbol a: {"a"}
Concatenation AB: {rs|r∈A, s∈B}
Alternation A|B: {A∪B}
Repetition A*: {ε|A|(AA)|...}

RE Ambiguities

Disambiguation rules

                       - __Longest match__ wins
                       - Otherwise, __rule listed first__ wins

1. for - keyword = for
2. letter = [a-zA-Z]
3. digit = [0-9]
4. identifier = letter(letter|digit)*
5. sign = +|-|ε
6. integer = sign(0|[1-9]digit*)
7. decimal = integer.digit*
8. real = (integer|decimal)E sign digit+

Finite Automata

Formal language theory

                       - `Language Spec <-> RE <-> NFA <-> ε-NFA <-> DFA <-> mDFA`

Non-deterministic Finite Automata (NFA): more than 1 next-state for a given symbol
```
                       - __ε-NFA__
```

Deterministic Finite Automata (DFA): at most 1 transition from 1 state to another for a given symbol

                       - __Minimal DFA__
                       - __Transition table__: `T[s;c]` for a state `s` and a character `c`

RE -> NFA

Thompson's Procedure

Linear transition (not the other way round)
NFAs & REs difficult to program

NFA -> DFA

Closure: determining a subset of states in NFA forming a state in DFA
Any DFA state containing a final NFA state is a final state
Many non-deterministic paths -> more combined states

Every RE has a unique minimal DFA -> compare and see if REs are equivalent

                       1. __Group accepting states__ together, others together.
                       2. __Pick unmarked group__ `G` and check consistency. 
                                                    - Consistent: mark
                                                    - Inconsistent: __split__ into maximal consistent subgroups and replace `G` by these. Unmark all groups.
                       3. Do until no unmarked groups.

Error States

If not all possible state transitions are given, then incomplete -> can use error state

DFA -> RE

Language spec to RE not always easy, sometimes to DFA easier, but programming languages only accept REs.

Examples

Transitive Closure Method

Clear & simple
Tedious; creates long REs

Q = {q1, q2, ...} be set of automation states
R_ij = set of all strings transit from qi to qj
RE = union of all R_sf, qs = starting state, qf = 1 of final states

R_k_ij: set of all strings that transit from qi to qj without passing though any state higher than qk

R_1_ij,R_2_ij,...,R_|Q|_ij = R_ij

R_k_ij = R_(k-1)_ij 
       + R_(k-1)_ik (R_(k-1)_kk)* R_(k-1)_kj

R_0_ij = r     if i!=j and r transitions from qi to qj
         r + ε if i==j and r transitions from qi to qj
         ∅     otherwise

State Removal Method

Intuitive, useful
Not straightforward to implement

Unify all final states into a single one using ε-trans
Unify all multi-transitions into a single one that contains union of inputs
Remove states & change transitions accordingly, until only starting & final states
Get resulting RE by direct calculation

Algebraic Method (Brzozowski)

Elegant
Creates compact REs

Create a system of REs with 1 unknown for each state and solve for R1

                       - For each state `qi`, `Ri` is a __union of terms__
                       - `aRj`: transition `a` from `qi` to `qj`
                       - `ε`: if `Ri` is final state

Arden’s Lemma

                       - `E = α + Eβ = α(β)*`, where `α`, `β` are REs

Limitations

RE describes only regular languages
RE <-> finite automaton, which only has a finite number of states
Nested structures (e.g. parentheses, comments, etc.) need a stack of sufficiently large size for processing
Start states? -> flex/lex provides

Lexical Analysis

Content

Overview

Scanner Implementations

Regular Expressions

RE Ambiguities

Finite Automata

RE -> NFA

NFA -> DFA

Error States

DFA -> RE

Transitive Closure Method

State Removal Method

Algebraic Method (Brzozowski)

Limitations

results matching ""

No results matching ""