z, ? | toggle help (this) |

space, → | next slide |

shift-space, ← | previous slide |

d | toggle debug mode |

## <ret> | go to slide # |

c, t | table of contents (vi) |

f | toggle footer |

g | toggle follow |

r | reload slides |

n | toggle notes |

p | run preshow |

P | toggle pause |

s | choose style |

- Emmanuel Bastien
- Riviera.rb
- February 4, 2014

- Most of your programs take some kind of input.
- Let's assume you don't blindly start processing any input.
- How do you make sure the input is valid?

```
You must validate with a computational mechanism
at least as strong as the language being validated
-- Meredith L. Patterson
```

- So you build a parser,
- And now you are safe ...

- How long does your parser take to accept or reject your input?

- Type-0: All formal grammars
- Type-1: Context-sensitive grammars
- Type-2: Context-free grammars
- Type-3: Regular grammars

- Type-0/1: Turing machine
- Type-2: Pushdown automaton
- Type-3: Finite state automaton

- How long might a Turing machine take to accept or reject your input?

- At most deterministic context-free
- Or even better, regular

Subset of Lucene query syntax:

```
<query> ::= [ <field> ':' ] <term>
<term> ::= <word> | '"' <phrase> '"'
Examples:
title:"regular grammar"
keyword:parser
rivierarb
"finite \"state\" automaton"
```

- Let's use Ruby built-in regular expressions!

```
# <query> ::= [ <field> ':' ] <term>
# <term> ::= <word> | '"' <phrase> '"'
query = /^(\w+:)?(\w+|\"([^\"]|\[\"])*\")$/
```

`"aaaaaaaaaaaaaaaaaaaaaaaaaaaaab" =~ /^(a+)+$/`

```
Today, regular expressions have also become a
shining example of how ignoring good theory
leads to bad programs.
-- Russ Cox
```

- Extensions like
*backreferences*built into common standard regular expression implementations lead to non-context-free grammars. - In this case there is no better algorithm than exponential!
- So be careful when writting regular expressions,
- and
**NEVER**take regular expressions from untrusted input!

```
title:"regular grammar" AND
( keyword:parser OR rivierarb ) AND
"finite \"state\" automaton"
```

- LR(k) -- rightmost derivation with
*k*lookahead - LL(k) -- leftmost derivarion with
*k*lookahead - Packrat/PEG
- Reminder: ad-hoc recursive-descent with naive backtracking is exponential-time!

- GNU Bison (LALR)
- ANTLR (LL & more)
- Treetop (PEG)

- Recognition-based (vs. generation-based)
- Prioritized choices
- Greedy rules
- Syntactic predicates
- No separate lexer (expression of mixed languages!)

- finite set of terminals
- finite set of nonterminals N
- finite set of rules of the form A <- e where A ∈ N, e is a parsing expression
- a parsing expression called the start expression

- empty string
- terminal
- nonterminal
- sequence of parsing expressions
- prioritized choice between alternatives
- optional, zero-or-more, one-or-more (all greedy)
- syntactic predicates (does not consume input)

```
grammar Arithmetic
rule additive
multitive ( '+' multitive )*
end
rule multitive
primary ( '*' primary )*
end
rule primary
'(' additive ')' / number
end
rule number
'-'? [0-9]+
end
end
```

- Packrat parsers are not minimal in space compared to LL/LR parsers.
- Not good
*as is*for*stateful*syntax (e.g. C and C++). - Equality of languages generated by CFGs is undecidable.
- No proof that PEGs can generate all context-free languages.
- Left recursion is not available with PEGs.
- No built-in operator precedence support.

PAUSED