Content

Grammar
Parse
1. Parse Representation
  1. Top-down Derivation & Bottom-up Reduction
  2. Parse Tree (Concrete Syntax Tree)
2. Parse Notation
  1. Backus-Naur Form (BNF)
  2. Extended BNF (EBNF)
Context-Free Grammar (CFG)
1. Definition
Shift-Reduce Parsing
Ambiguous Grammar
Recursive Productions & Factoring
Left or Right Recursion?
Generalized LR

Grammar

sentence -> <subject> <verb> <object>
subject  -> This | computer | I
...

Each line: rule || derivation || production

Parse

Parse Representation

Top-down Derivation & Bottom-up Reduction

Generally, top-down parsers easier to write than bottom-up.

Parse Tree (Concrete Syntax Tree)

No order of applying rules.

Parse Notation

Backus-Naur Form (BNF)

S := '-'FN | FN
FN := DL | DL '.' DL
DL := D | D DL
D :='0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9

Terminal: in quote Symbol (S: start symbol)

Token: terminals (quoted literals || UPPERCASE identifiers)

Extended BNF (EBNF)

S := '-'? D+ ('.' D+)?
D := '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Can be written in BNF.

Context-Free Grammar (CFG)

CFG ≡ corresponding pushdown automata (PDA)
PDA = FSA + stack (memory) RE = FSA (no memory) Turing Machine = FSA + 2 stacks

Difficult to parse -> easy to parse

Definition

Alphabet (V)
- Terminal alphabet (T) (input alphabet)
- Non-terminal alphabet (N) (stack alphabet)
Grammar: 4-tuple {S, P, N, T}
- S: starting symbol
- P: production rules
  - R -> E
    - R can be replaced by E
    - R can derive E
    - R can produce E
    - E can reduce to R
    - E can be derived from R
- Sentence: a string of symbols from T derived from S using P
Equivalence
- Language (L(G)): set of sentences derivable using G
- G is equivalent to G' if L(G) = L(G')

Shift-Reduce Parsing

Ambiguous Grammar

Operator precedence

Tell compiler your precedence choice
```
  %left '+' '-'
  %left '*' '/'
```

Rewrite grammar (introduce term and factor)

  E -> E t_op E | T
  t_op -> + | -
  T -> T f_op T | F
  f_op -> * | /
  F -> int | (E)

Operator associativity

Tell compiler

Rewrite grammar (left recursion)

  E -> E t_op T | T
  t_op -> + | -
  T -> T f_op F | F
  f_op -> * | /
  F -> int | (E)

Dangling else: if C1 then E1 if C2 then E2 else E3

Tell compiler

  %nonassoc IFX
  %nonassoc ELSE
  ...stmt: ... | IF C stmt %prec IFX | IF C stmt ELSE stmt ...

Rewrite grammar

  IFE -> IFm | IFu
  IFm -> if C then IFm else IFm | E
  IFu -> if C then E | if C then IFm else IFu | E

Recursive Productions & Factoring

# left recursion:
X -> Xa | Xb | AB | C | DEF

# to right recursion:
X -> ABX' | CX' | DEFX'
X' -> aX' | bX' | ε

# top-down parsers can't decide which production to use:
Stmt -> if C then Stmt else Stmt | if C then Stmt | Other | ...

# factor out common prefix:
Stmt -> if C then Stmt OptElse | Other | ...
OptElse -> else S | ε

Left or Right Recursion?

Left recursion: bad for LL(1), good for LALR(1)
- Keep reducing as we go along (keep the stack space small)
Right recursion
- All items pushed onto stack, then start reducing

But same amount of work!

Generalized LR

S -> aSa | bSb | ε

LR parser can't tell where is the midpoint.

A GLR parser will process all possible interpretations of the given input in BFS; when there's conflict, fork 2 threads and try both.
Can be > 1 accepted path (ambiguous).

Parsing