emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: prior work on non-backtracking regex engine?


From: Danny McClanahan
Subject: Re: prior work on non-backtracking regex engine?
Date: Sun, 07 Apr 2024 04:42:13 +0000

Thanks so much for this helpful feedback, Ihor! This makes me much more 
confident about packaging this as a dependency!

I've written out a summary of the packaging story for this proposed new 
dependency, along with a tentative work outline at the bottom.

On Wednesday, March 13th, 2024 at 09:19, Ihor Radchenko <yantar92@posteo.net> 
wrote:
> Danny McClanahan dmcc2@hypnicjerk.ai writes:
> 
> > (4) What is the best way to package third-party code for emacs?
> > 
> > I was thinking of architecting this regex engine as a third-party
> > codebase exposing a C ABI shared library, which the emacs build system
> > could detect as an optional dependency (like libjansson). I was hoping
> > to use rust to implement this regex engine, but I know that a cargo
> > package alone isn't enough for non-rust code to depend on: I was also
> > planning to maintain package recipes for this regex engine for
> > multiple linux distros, so that emacs could easily add it to the build
> > system (like librsvg). Are there any further constraints I should know
> > about for optional dependencies in the emacs build system?
> 
> 
> AFAIR, previous discussions raised some concerns about using Rust in
> particular:
> https://yhetil.org/emacs-devel/E1pKAM0-0001Ss-W6@fencepost.gnu.org/
> Because of Rust volatility and some license issues.
> 
> That said, the linked discussion was not about linking to rust libraries
> via C ABI, but about contributing Rust code to Emacs core directly.

Thanks so much for this context! It sounds like an optional C ABI library 
dependency is absolutely the ideal way to package something for emacs to depend 
on, both for legal and maintainability reasons. I absolutely agree with the 
quote from that linked discussion that "Rust changes too fast" for emacs to 
reliably expect to build its own rust code. The rust language makes it quite 
difficult to avoid involving cargo when packaging rust code for distribution, 
but the emacs build system itself should not need to be aware of cargo or rust 
at all, as per "we should not publicly refer to its existence" (to avoid hidden 
dependencies on its unstable behaviors). Each *distro packager* for this rust 
regex library would need to invoke cargo to compile/link the C ABI library + 
generated C headers, but those would just be installed into somewhere like 
/usr/include and /usr/lib like any other C library, so emacs should not need to 
be made aware of any rust- or cargo-specific behavior even if I choose to use 
rust as an implementation language. I am extremely enthusiastic about achieving 
a rust packaging solution for this work that can serve as a role model for 
other portable rust code exposing a C ABI library interface.

I'm particularly hoping to package any rust output as a shared and/or static C 
ABI library using rust's no_std configuration, which removes a large portion of 
rust stdlib code as well as avoiding any implicit libc dependency. Rust 
actually has a libc crate (https://github.com/rust-lang/libc) which I believe 
can be used in no_std environments to invoke libc if needed, but since I am 
hoping to produce a C ABI library interface which delegates to the caller for 
alloc/free callbacks (instead of performing any dynamic allocation in the regex 
engine library code), it may be feasible to avoid even a libc dep.

If I employ any exotic instructions, it would be via rust's core::simd API (see 
e.g. https://mcyoung.xyz/2023/11/27/simd-base64/) for platform-specific 
vectorized instruction selection. This thoughtful and simple interface falls 
back to portable SWAR techniques for generic platforms (see 
https://doc.rust-lang.org/stable/core/simd/index.html#portable-simd-works-on-every-target),
 so this feature of the rust language would actually *improve* our portability. 
However, since llvm (and therefore rust) still supports fewer platforms than 
gcc, I also intend to incorporate rustc_codegen_gcc (which compiles rust with 
libgccjit) and/or the gccrs frontend (not yet functional but improving rapidly) 
into the packaging scripts for this regex engine library, to ensure that emacs 
does not begin to implicitly disadvantage non-llvm platforms.

> Another somewhat relevant discussion is
> https://yhetil.org/emacs-devel/95980ffc-86e7-ad54-4a20-539d8c6ea5d0@mailo.com/

Wow, this is super neat!! A lot of my programming experience has actually been 
on developing build tools or integrating them with editors and this 
discussion/library looks super thoughtful! I will have to chew on this 
discussion more.

> P.S. Better regexp engine would be very welcome.

^_^ !!! This makes me so happy to hear! I took the time to learn how 
regex-emacs.c as well as other emacs C code interacts with the gap buffer last 
week, and was super glad to see how simple the gap buffer data structure is. I 
am not an expert on string search performance yet, but I believe the gap buffer 
(which is always allocated within a single block, and has at most two 
non-adjacent data sections) is likely much more amenable to high-performance 
search techniques than a more complex data structure such as a rope. And I was 
also *super* pleased to see that regex-emacs.h itself doesn't expose any 
dependency on the gap buffer or other internal emacs representations (except 
regarding multibyte encoding). So in my amateur evaluation, emacs actually 
seems very well-placed to take advantage of high-performance regex engine 
techniques without any big structural changes.

As a concrete example, I currently maintain helm-rg on MELPA 
(https://github.com/cosmicexplorer/helm-rg), which itself performs quite a few 
regex matches to extract process output from ripgrep, and have previously 
contributed to helm-ag and helm-swoop (the latter of which searches within 
emacs buffers instead of hitting the filesystem). While ripgrep is extremely 
fast (even over very large corporate monorepos) and emacs makes it simple 
enough to communicate efficiently with an asynchronous subprocess, I realized 
that regex matching over buffer contents (which emacs has already mapped into 
its virtual memory space) might be an improvement for multiple reasons:
(a) it avoids an unnecessary round-trip through the filesystem (also friendlier 
to TRAMP),
(b) it allows the emacs user to narrow the search via their preferred method of 
buffer list selection (as opposed to maintaining a separate interface to 
configure ripgrep's fs traversal with CLI flags and ignore files).

So practically, I'm hoping this work allows me to effectively deprecate helm-rg 
and replace a lot of finicky lisp code (which currently performs asynchronous 
IPC, globbing files, and writing back to the filesystem when editing match 
results inline) with a simple loop over a provided buffer list that performs a 
very fast zero-copy in-memory search and batched pattern replacement.

I would *really like* to eventually have emacs depend on an existing regex 
engine like re2 or rust regex to take advantage of their bugfixes and 
optimizations, but both of those engines (and most others) require utf-8 input, 
and (I'm pretty sure) can't easily be made to support emacs's multibyte 
functionality. So I think there is a strong case for a new engine here, 
especially one licensed with the GPLv3 (or any later version) as opposed to the 
LGPL or other more permissive license.

-------
Informal sketch of work outline:
(1) simple, robust non-backtracking regex matcher with a backwards-compatible 
API for regex-emacs.h (no support for backrefs, maybe just a single NFA matcher 
with the PikeVM to test the API)
(2) basic benchmark harness + integration into emacs build system (so we can 
get a concrete performance comparison and %improvement, and start creating 
distro packages)
(3) implement a lazy DFA / other specialized automata to get performance 
generally on par with re2/rust regex
(4) implement a SIMD prefilter for literal substrings (the prototype should now 
be significantly faster than regex-emacs.c on all supported inputs)
(5) separate out a regex pattern parsing API, and expose regex objects to lisp 
code
(6) try implementing backrefs
(7) batched search-replace API for replacement of pattern strings in buffer 
contents

In general, I've decided to put off these features at first:
(a) backrefs
(b) per-mode customizable char classes & word/symbol boundaries
(c) regex matching against syntax properties (not the raw string data)

- I will probably propose deprecating (c) entirely, especially if tree-sitter 
ends up being able to take over syntax propertization in general.
- (b) is non-standard, and if we deprecate it we may be able to more easily 
lean on an existing regex pattern parser like regex-syntax 
(https://docs.rs/regex-syntax/), although regex-syntax also does not support 
(a) backrefs and it probably makes sense to make our own pattern parsing 
library anyway as part of (5).
- I personally would like to avoid entirely deprecating (a) backrefs even if by 
definition they require some level of backtracking, because I personally 
consider them very handy for small bounded inputs. If I can't figure out a 
clean way to support backrefs in this new engine, I would like to simply retain 
the regex-emacs.c engine for backref patterns (and maybe add some defvar that 
warns/errors if a backref pattern is compiled when set).



Thanks so much for providing such thoughtful and patient feedback!
--Danny



reply via email to

[Prev in Thread] Current Thread [Next in Thread]