Homepage: https://www.gnu.org/software/emacs
Author: Simon Marshall
Generate efficient regexps to match strings
The "opt" in "regexp-opt" stands for "optim\\(al\\|i[sz]e\\)".
This package generates a regexp from a given list of strings (which matches
one of those strings) so that the regexp generated by:
(regexp-opt strings)
is equivalent to, but more efficient than, the regexp generated by:
(mapconcat 'regexp-quote strings "\\|")
For example:
(let ((strings '("cond" "if" "when" "unless" "while"
"let" "let*" "progn" "prog1" "prog2"
"save-restriction" "save-excursion" "save-window-excursion"
"save-current-buffer" "save-match-data"
"catch" "throw" "unwind-protect" "condition-case")))
(concat "(" (regexp-opt strings t) "\\>"))
=> "(\\(c\\(atch\\|ond\\(ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(current-buffer\\|excursion\\|match-data\\|restriction\\|window-excursion\\)\\|throw\\|un\\(less\\|wind-protect\\)\\|wh\\(en\\|ile\\)\\)\\>"
Searching using the above example `regexp-opt' regexp takes approximately
two-thirds of the time taken using the equivalent `mapconcat' regexp.
Since this package was written to produce efficient regexps, not regexps
efficiently, it is probably not a good idea to in-line too many calls in
your code, unless you use the following trick with `eval-when-compile':
(defvar definition-regexp
(eval-when-compile
(concat "^("
(regexp-opt '("defun" "defsubst" "defmacro" "defalias"
"defvar" "defconst") t)
"\\>")))
The `byte-compile' code will be as if you had defined the variable thus:
(defvar definition-regexp
"^(\\(def\\(alias\\|const\\|macro\\|subst\\|un\\|var\\)\\)\\>")
Note that if you use this trick for all instances of `regexp-opt' and
`regexp-opt-depth' in your code, regexp-opt.el would only have to be loaded
at compile time. But note also that using this trick means that should
regexp-opt.el be changed, perhaps to fix a bug or to add a feature to
improve the efficiency of `regexp-opt' regexps, you would have to recompile
your code for such changes to have effect in your code.
Originally written for font-lock.el, from an idea from Stig's hl319.el, with
thanks for ideas also to Michael Ernst, Bob Glickstein, Dan Nicolaescu and
Stefan Monnier.
No doubt `regexp-opt' doesn't always produce optimal regexps, so code, ideas
or any other information to improve things are welcome.
One possible improvement would be to compile '("aa" "ab" "ba" "bb")
into "[ab][ab]" rather than "a[ab]\\|b[ab]". I'm not sure it's worth
it but if someone knows how to do it without going through too many
contortions, I'm all ears.