peg

Homepage: https://www.gnu.org/software/emacs

Author: Helmut Eller

Summary

Parsing Expression Grammars in Emacs Lisp

Commentary

This package implements Parsing Expression Grammars for Emacs Lisp.

Parsing Expression Grammars (PEG) are a formalism in the spirit of
Context Free Grammars (CFG) with some simplifications which makes
the implementation of PEGs as recursive descent parsers particularly
simple and easy to understand [Ford, Baker].
PEGs are more expressive than regexps and potentially easier to use.

This file implements the macros `define-peg-rule', `with-peg-rules', and
`peg-parse' which parses the current buffer according to a PEG.
E.g. we can match integers with:

    (with-peg-rules
        ((number sign digit (* digit))
         (sign   (or "+" "-" ""))
         (digit  [0-9]))
      (peg-run (peg number)))
or
    (define-peg-rule digit ()
      [0-9])
    (peg-parse (number sign digit (* digit))
               (sign   (or "+" "-" "")))

In contrast to regexps, PEGs allow us to define recursive "rules".
A "grammar" is a set of rules.  A rule is written as (NAME PEX...)
E.g. (sign (or "+" "-" "")) is a rule with the name "sign".
The syntax for PEX (Parsing Expression) is a follows:

    Description		Lisp		Traditional, as in Ford's paper
    ===========		====		===========
    Sequence			(and E1 E2)	e1 e2
    Prioritized Choice	(or E1 E2)	e1 / e2
    Not-predicate		(not E)		!e
    And-predicate		(if E)		&e
    Any character		(any)		.
    Literal string		"abc"		"abc"
    Character C		(char C)	'c'
    Zero-or-more		(* E)		e*
    One-or-more		(+ E)		e+
    Optional			(opt E)		e?
    Non-terminal             SYMBOL		A
    Character range		(range A B)	[a-b]
    Character set		[a-b "+*" ?x]	[a-b+*x]   ;Note: it's a vector
    Character classes	[ascii cntrl]
    Boolean-guard		(guard EXP)
    Syntax-Class		(syntax-class NAME)
    Local definitions	(with RULES PEX...)
    Indirect call            (funcall EXP ARGS...)
and
    Empty-string		(null)		ε
    Beginning-of-Buffer	(bob)
    End-of-Buffer		(eob)
    Beginning-of-Line	(bol)
    End-of-Line		(eol)
    Beginning-of-Word	(bow)
    End-of-Word		(eow)
    Beginning-of-Symbol	(bos)
    End-of-Symbol		(eos)

Rules can refer to other rules, and a grammar is often structured
as a tree, with a root rule referring to one or more "branch
rules", all the way down to the "leaf rules" that deal with actual
buffer text.  Rules can be recursive or mutually referential,
though care must be taken not to create infinite loops.

Named rulesets:

You can define a set of rules for later use with:

    (define-peg-ruleset myrules
      (sign  () (or "+" "-" ""))
      (digit () [0-9])
      (nat   () digit (* digit))
      (int   () sign digit (* digit))
      (float () int "." nat))

and later refer to it:

    (with-peg-rules
        (myrules
         (complex float "+i" float))
      ... (peg-parse nat "," nat "," complex) ...)

Parsing actions:

PEXs also support parsing actions, i.e. Lisp snippets which are
executed when a pex matches.  This can be used to construct syntax
trees or for similar tasks.  The most basic form of action is
written as:

    (action FORM)          ; evaluate FORM for its side-effects

Actions don't consume input, but are executed at the point of
match.  Another kind of action is called a "stack action", and
looks like this:

    `(VAR... -- FORM...)   ; stack action

A stack action takes VARs from the "value stack" and pushes the
results of evaluating FORMs to that stack.

The value stack is created during the course of parsing.  Certain
operators (see below) that match buffer text can push values onto
this stack.  "Upstream" rules can then draw values from the stack,
and optionally push new ones back.  For instance, consider this
very simple grammar:

(with-peg-rules
    ((query (+ term) (eol))
     (term key ":" value (opt (+ [space]))
	   `(k v -- (cons (intern k) v)))
     (key (substring (and (not ":") (+ [word]))))
     (value (or string-value number-value))
     (string-value (substring (+ [alpha])))
     (number-value (substring (+ [digit]))
		   `(val -- (string-to-number val))))
  (peg-run (peg query)))

This invocation of `peg-run' would parse this buffer text:

name:Jane age:30

And return this Elisp sexp:

((age . 30) (name . "Jane"))

Note that, in complex grammars, some care must be taken to make
sure that the number and type of values drawn from the stack always
match those pushed.  In the example above, both `string-value' and
`number-value' push a single value to the stack.  Since the `value'
rule only includes these two sub-rules, any upstream rule that
makes use of `value' can be confident it will always and only push
a single value to the stack.

Stack action forms are in a sense analogous to lambda forms: the
symbols before the "--" are the equivalent of lambda arguments,
while the forms after the "--" are return values.  The difference
being that a lambda form can only return a single value, while a
stack action can push multiple values onto the stack.  It's also
perfectly valid to use `(-- FORM...)' or `(VAR... --)': the former
pushes values to the stack without consuming any, and the latter
pops values from the stack and discards them.

Derived Operators:

The following operators are implemented as combinations of
primitive expressions:

    (substring E)  ; Match E and push the substring for the matched region.
    (region E)     ; Match E and push the start and end positions.
    (replace E RPL); Match E and replace the matched region with RPL.
    (list E)       ; Match E and push a list of the items that E produced.

See `peg-ex-parse-int' in `peg-tests.el' for further examples.

Regexp equivalents:

Here a some examples for regexps and how those could be written as pex.
[Most are taken from rx.el]

    "^[a-z]*"
    (and (bol) (* [a-z]))

    "\n[^ \t]"
    (and "\n" (not [" \t"]) (any))

    "\\*\\*\\* EOOH \\*\\*\\*\n"
    "*** EOOH ***\n"

    "\\<\\(catch\\|finally\\)\\>[^_]"
    (and (bow) (or "catch" "finally") (eow) (not "_") (any))

    "[ \t\n]*:\\([^:]+\\|$\\)"
    (and (* [" \t\n"]) ":" (or (+ (not ":") (any)) (eol)))

    "^content-transfer-encoding:\\(\n?[\t ]\\)*quoted-printable\\(\n?[\t ]\\)*"
    (and (bol)
         "content-transfer-encoding:"
         (* (opt "\n") ["\t "])
         "quoted-printable"
         (* (opt "\n") ["\t "]))

    "\\$[I]d: [^ ]+ \\([^ ]+\\) "
    (and "$Id: " (+ (not " ") (any)) " " (+ (not " ") (any)) " ")

    "^;;\\s-*\n\\|^\n"
    (or (and (bol) ";;" (* (syntax-class whitespace)) "\n")
        (and (bol) "\n"))

    "\\\\\\\\\\[\\w+"
    (and "\\\\[" (+ (syntax-class word)))

See ";;; Examples" in `peg-tests.el' for other examples.

Rule argument and indirect calls:

Rules can take arguments and those arguments can themselves be PEGs.
For example:

    (define-peg-rule 2-or-more (peg)
      (funcall peg)
      (funcall peg)
      (* (funcall peg)))

    ... (peg-parse
         ...
         (2-or-more (peg foo))
         ...
         (2-or-more (peg bar))
         ...)

References:

[Ford] Bryan Ford. Parsing Expression Grammars: a Recognition-Based
Syntactic Foundation. In POPL'04: Proceedings of the 31st ACM
SIGPLAN-SIGACT symposium on Principles of Programming Languages,
pages 111-122, New York, NY, USA, 2004. ACM Press.
http://pdos.csail.mit.edu/~baford/packrat/

[Baker] Baker, Henry G. "Pragmatic Parsing in Common Lisp".  ACM Lisp
Pointers 4(2), April--June 1991, pp. 3--15.
http://home.pipeline.com/~hbaker1/Prag-Parse.html

Roman Redziejowski does good PEG related research
http://www.romanredz.se/pubs.htm

Todo:

- Fix the exponential blowup in `peg-translate-exp'.
- Add a proper debug-spec for PEXs.

Dependencies

Reverse dependencies