smie

Homepage: https://www.gnu.org/software/emacs

Author: Stefan Monnier

Summary

Simple Minded Indentation Engine

Commentary

While working on the SML indentation code, the idea grew that maybe
I could write something generic to do the same thing, and at the
end of working on the SML code, I had a pretty good idea of what it
could look like.  That idea grew stronger after working on
LaTeX indentation.

So at some point I decided to try it out, by writing a new
indentation code for Coq while trying to keep most of the code
"table driven", where only the tables are Coq-specific.  The result
(which was used for Beluga-mode as well) turned out to be based on
something pretty close to an operator precedence parser.

So here is another rewrite, this time following the actual principles of
operator precedence grammars.  Why OPG?  Even though they're among the
weakest kinds of parsers, these parsers have some very desirable properties
for Emacs:
- most importantly for indentation, they work equally well in either
  direction, so you can use them to parse backward from the indentation
  point to learn the syntactic context;
- they work locally, so there's no need to keep a cache of
  the parser's state;
- because of that locality, indentation also works just fine when earlier
  parts of the buffer are syntactically incorrect since the indentation
  looks at "as little as possible" of the buffer to make an indentation
  decision.
- they typically have no error handling and can't even detect a parsing
  error, so we don't have to worry about what to do in case of a syntax
  error because the parser just automatically does something.  Better yet,
  we can afford to use a sloppy grammar.

The benefits of this approach were presented in the following article,
which includes a kind of tutorial to get started with SMIE:

    SMIE: Weakness is Power!  Auto-indentation with incomplete information
    Stefan Monnier,  Journal 2021, volume 5, issue 1.
    doi: https://doi.org/10.22152/programming-journal.org/2021/5/1

A good background to understand the development (especially the parts
building the 2D precedence tables and then computing the precedence levels
from it) can be found in pages 187-194 of "Parsing techniques" by Dick Grune
and Ceriel Jacobs (BookBody.pdf available at
https://dickgrune.com/Books/PTAPG_1st_Edition/).

OTOH we had to kill many chickens, read many coffee grounds, and practice
untold numbers of black magic spells, to come up with the indentation code.
Since then, some of that code has been beaten into submission, but the
`smie-indent-keyword' function is still pretty obscure.


Conflict resolution:

- One source of conflicts is when you have:
    (exp ("IF" exp "ELSE" exp "END") ("CASE" cases "END"))
    (cases (cases "ELSE" insts) ...)
  The IF-rule implies ELSE=END and the CASE-rule implies ELSE>END.
  This can be resolved simply with:
    (exp ("IF" expelseexp "END") ("CASE" cases "END"))
    (expelseexp (exp) (exp "ELSE" exp))
    (cases (cases "ELSE" insts) ...)
- Another source of conflict is when a terminator/separator is used to
  terminate elements at different levels, as in:
    (decls ("VAR" vars) (decls "," decls))
    (vars (id) (vars "," vars))
  often these can be resolved by making the lexer distinguish the two
  kinds of commas, e.g. based on the following token.

TODO & BUGS:

- We could try to resolve conflicts such as the IFexpELSEexpEND -vs-
  CASE(casesELSEexp)END automatically by changing the way BNF rules such as
  the IF-rule is handled.  I.e. rather than IF=ELSE and ELSE=END, we could
  turn them into IFEND and IF=END.
- Using the structural information SMIE gives us, it should be possible to
  implement a `smie-align' command that would automatically figure out what
  there is to align and how to do it (something like: align the token of
  lowest precedence that appears the same number of times on all lines,
  and then do the same on each side of that token).
- Maybe accept two juxtaposed non-terminals in the BNF under the condition
  that the first always ends with a terminal, or that the second always
  starts with a terminal.
- Permit EBNF-style notation.
- If the grammar has conflicts, the only way is to make the lexer return
  different tokens for the different cases.  This extra work performed by
  the lexer can be costly and unnecessary: we perform this extra work every
  time we find the conflicting token, regardless of whether or not the
  difference between the various situations is relevant to the current
  situation.  E.g. we may try to determine whether a ";" is a ";-operator"
  or a ";-separator" in a case where we're skipping over a "begin..end" pair
  where the difference doesn't matter.  For frequently occurring tokens and
  rarely occurring conflicts, this can be a significant performance problem.
  We could try and let the lexer return a "set of possible tokens
  plus a refinement function" and then let parser call the refinement
  function if needed.
- Make it possible to better specify the behavior in the face of
  syntax errors.  IOW provide some control over the choice of precedence
  levels within the limits of the constraints.  E.g. make it possible for
  the grammar to specify that "begin..end" has lower precedence than
  "Module..EndModule", so that if a "begin" is missing, scanning from the
  "end" will stop at "Module" rather than going past it (and similarly,
  scanning from "Module" should not stop at a spurious "end").

Dependencies

Reverse dependencies