Homepage: https://www.gnu.org/software/emacs
Author: Stefan Monnier
Simple Minded Indentation Engine
While working on the SML indentation code, the idea grew that maybe I could write something generic to do the same thing, and at the end of working on the SML code, I had a pretty good idea of what it could look like. That idea grew stronger after working on LaTeX indentation. So at some point I decided to try it out, by writing a new indentation code for Coq while trying to keep most of the code "table driven", where only the tables are Coq-specific. The result (which was used for Beluga-mode as well) turned out to be based on something pretty close to an operator precedence parser. So here is another rewrite, this time following the actual principles of operator precedence grammars. Why OPG? Even though they're among the weakest kinds of parsers, these parsers have some very desirable properties for Emacs: - most importantly for indentation, they work equally well in either direction, so you can use them to parse backward from the indentation point to learn the syntactic context; - they work locally, so there's no need to keep a cache of the parser's state; - because of that locality, indentation also works just fine when earlier parts of the buffer are syntactically incorrect since the indentation looks at "as little as possible" of the buffer to make an indentation decision. - they typically have no error handling and can't even detect a parsing error, so we don't have to worry about what to do in case of a syntax error because the parser just automatically does something. Better yet, we can afford to use a sloppy grammar. The benefits of this approach were presented in the following article, which includes a kind of tutorial to get started with SMIE: SMIE: Weakness is Power! Auto-indentation with incomplete information Stefan Monnier,Journal 2021, volume 5, issue 1. doi: https://doi.org/10.22152/programming-journal.org/2021/5/1 A good background to understand the development (especially the parts building the 2D precedence tables and then computing the precedence levels from it) can be found in pages 187-194 of "Parsing techniques" by Dick Grune and Ceriel Jacobs (BookBody.pdf available at https://dickgrune.com/Books/PTAPG_1st_Edition/). OTOH we had to kill many chickens, read many coffee grounds, and practice untold numbers of black magic spells, to come up with the indentation code. Since then, some of that code has been beaten into submission, but the `smie-indent-keyword' function is still pretty obscure. Conflict resolution: - One source of conflicts is when you have: (exp ("IF" exp "ELSE" exp "END") ("CASE" cases "END")) (cases (cases "ELSE" insts) ...) The IF-rule implies ELSE=END and the CASE-rule implies ELSE>END. This can be resolved simply with: (exp ("IF" expelseexp "END") ("CASE" cases "END")) (expelseexp (exp) (exp "ELSE" exp)) (cases (cases "ELSE" insts) ...) - Another source of conflict is when a terminator/separator is used to terminate elements at different levels, as in: (decls ("VAR" vars) (decls "," decls)) (vars (id) (vars "," vars)) often these can be resolved by making the lexer distinguish the two kinds of commas, e.g. based on the following token. TODO & BUGS: - We could try to resolve conflicts such as the IFexpELSEexpEND -vs- CASE(casesELSEexp)END automatically by changing the way BNF rules such as the IF-rule is handled. I.e. rather than IF=ELSE and ELSE=END, we could turn them into IF END and IF=END. - Using the structural information SMIE gives us, it should be possible to implement a `smie-align' command that would automatically figure out what there is to align and how to do it (something like: align the token of lowest precedence that appears the same number of times on all lines, and then do the same on each side of that token). - Maybe accept two juxtaposed non-terminals in the BNF under the condition that the first always ends with a terminal, or that the second always starts with a terminal. - Permit EBNF-style notation. - If the grammar has conflicts, the only way is to make the lexer return different tokens for the different cases. This extra work performed by the lexer can be costly and unnecessary: we perform this extra work every time we find the conflicting token, regardless of whether or not the difference between the various situations is relevant to the current situation. E.g. we may try to determine whether a ";" is a ";-operator" or a ";-separator" in a case where we're skipping over a "begin..end" pair where the difference doesn't matter. For frequently occurring tokens and rarely occurring conflicts, this can be a significant performance problem. We could try and let the lexer return a "set of possible tokens plus a refinement function" and then let parser call the refinement function if needed. - Make it possible to better specify the behavior in the face of syntax errors. IOW provide some control over the choice of precedence levels within the limits of the constraints. E.g. make it possible for the grammar to specify that "begin..end" has lower precedence than "Module..EndModule", so that if a "begin" is missing, scanning from the "end" will stop at "Module" rather than going past it (and similarly, scanning from "Module" should not stop at a spurious "end").