[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[ELPA-diffs] /srv/bzr/emacs/elpa r334: New package lex.el.
From: |
Stefan Monnier |
Subject: |
[ELPA-diffs] /srv/bzr/emacs/elpa r334: New package lex.el. |
Date: |
Wed, 09 Jan 2013 15:19:02 -0500 |
User-agent: |
Bazaar (2.5.0) |
------------------------------------------------------------
revno: 334
committer: Stefan Monnier <address@hidden>
branch nick: elpa
timestamp: Wed 2013-01-09 15:19:02 -0500
message:
New package lex.el.
added:
packages/lex/
packages/lex/lex.el
=== added directory 'packages/lex'
=== added file 'packages/lex/lex.el'
--- a/packages/lex/lex.el 1970-01-01 00:00:00 +0000
+++ b/packages/lex/lex.el 2013-01-09 20:19:02 +0000
@@ -0,0 +1,1463 @@
+;;; lex.el --- Lexical analyser construction
+
+;; Copyright (C) 2008,2013 Free Software Foundation, Inc.
+
+;; Author: Stefan Monnier <address@hidden>
+;; Keywords:
+;; Version: 1.0
+
+;; This program is free software; you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; This program is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+;;; Commentary:
+
+;; Format of regexps is the same as used for `rx' and `sregex'.
+;; Additions:
+;; - (ere RE) specify regexps using the ERE syntax.
+;; - (inter REs...) (aka `&') make a regexp that only matches
+;; if all its branches match. E.g. (inter (ere ".*a.*") (ere ".*b.*"))
+;; match any string that contain both an "a" and a "b", in any order.
+;; - (case-fold REs...) and (case-sensitive REs...) make a regexp that
+;; is case sensitive or not, regardless of case-fold-search.
+
+;; Input format of lexers:
+;;
+;; ALIST of the form ((RE . VAL) ...)
+
+;; Format of compiled DFA lexers:
+;;
+;; nil ; The trivial lexer that fails
+;; (CHAR . LEXER)
+;; (table . CHAR-TABLE)
+;; (stop VAL . LEXER) ; Match the empty string at point or LEXER.
+;; (check (PREDICATE . ARG) SUCCESS-LEXER . FAILURE-LEXER)
+
+;; Intermediate NFA nodes may additionally look like:
+;; (or LEXERs...)
+;; (orelse LEXERs...)
+;; (and LEXERs...)
+;; (join CONT . EXIT)
+;; Note: we call those things "NFA"s but they're not really NFAs.
+
+;;; Bugs:
+
+;; - `inter' doesn't work right. Matching `join' to the corresponding `and'
+;; is done incorrectly in some cases.
+;; - since `negate' uses intersections, it doesn't work right either.
+;; - "(\<)*" leads to a DFA that gets stuck in a cycle.
+
+;;; Todo:
+
+;; - dfa "no-fail" simplifier
+;; - dfa minimization
+;; - dfa compaction (different representation)
+;; - submatches
+;; - backrefs?
+;; - search rather than just match
+;; - extensions:
+;; - repeated submatches
+;; - negation
+;; - lookbehind and lookahead
+;; - match(&search?) backward
+;; - agrep
+
+;;; Notes
+
+
+
+;; Search
+;; ------
+
+;; To turn a match into a search, the basic idea is to use ".*RE" to get
+;; a search-DFA as opposed to the match-DFA generated from "RE".
+
+;; Search in Plan9's regexp library is done as follows: match ".*RE" until
+;; reaching the first match and then continue with only "RE". The first
+;; ".*RE" match corresponds to a search success for the leftmost shortest
+;; match. If we want the longest match, we need to continue. But if we
+;; continue with ".*RE" then we have no idea when to stop, so we should only
+;; continue with "RE".
+;; Downside: we may still match things after the "leftmost longest" match,
+;; but hopefully will stop soon after. I.e. we may look at chars past the
+;; end of the leftmost longest match, but hopefully not too many.
+
+;; Alternatives:
+;; - Like emacs/src/regexp.c, we can just start a match at every buffer
+;; position. Advantage: no need for submatch info in order to find
+;; (match-beginning 0), no need for a separate search-DFA.
+;; Downsize: O(N^2) rather than O(N). But it's no worse than what we live
+;; with for decades in src/regexp.c.
+;;
+;; - After the shortest-search, stop the search and do a longest-match
+;; starting at position (match-beginning 0). The good thing is that we
+;; will not look at any char further than needed. Also we don't need to
+;; figure out how to switch from ".*RE" to "RE" in the middle of the search.
+;; The downside is that we end up looking twice at the chars common to the
+;; shortest and longest matches. Also this doesn't work: the shortest
+;; match may not be the leftmost match, so we can't just start the match
+;; at (match-beginning 0).
+;;
+;; - Generate a specialized search&match-DFA which encodes the job done by
+;; Plan9's regexp library. I.e. do a specialized merge on
+;; (or LEXER (anything . LEXER)) where whenever we get a `stop' we don't
+;; merge any more. After matching such a lexer, we still have to figure
+;; which of the matches we had is the leftmost longest match, of course.
+;; Actually, it's not that easy: the tail of a `stop' in the match-DFA can
+;; only match things whose (match-beginning 0) may be the same as the one
+;; of the `stop', whereas we also want to accept longer matches that start
+;; before (match-beginning 0). So we want to keep merging on the tail of
+;; `stop' nodes, but only "partially" (whatever that means).
+
+;; - Better yet, do what TRE does: after the shortest-search, use the
+;; submatch data to figure out the NFA states (corresponding to the
+;; current search-DFA state) which are only reachable from later starting
+;; positions than (match-beginning 0), remove them and figure out from
+;; that the match-DFA state to which to switch. Problem is: there might
+;; not be any such state in the match-DFA.
+;;
+;; - In the end I do a mix of the last 2: .*?RE
+;; This uses the `orelse' merge operator, which contrary to `or' only
+;; matches the righthand side when the lefthand side fails to match.
+;; It turns out to be fairly simple to implement, and is optimal.
+;;
+;; Lookahead
+;; ---------
+
+;; I suspect that the (?=<RE>) lookahead can be encoded using something like
+;; `andalso'. Of course, it can also trivially be encoded as a predicate,
+;; but then we get an O(N^2) complexity.
+
+;; Merging operators.
+;; ------------------
+
+;; The NFA merging operators (or, and, orelse) seem to work fine on their own,
+;; but I'm not convinced they always DTRT when combined. It's not even
+;; clear that the NFA->DFA conversion terminates in all such cases.
+
+;; Intersection
+;; ------------
+
+;; Implementing the `inter' regexp operator turns out to be more difficult
+;; than it seemed. The problem is basically in the `join'. Each `and' has
+;; to have its own matching `join', but preserving this invariant is
+;; tricky. Among other things, we cannot flatten nested `and's like we do
+;; for `or's and `orelse's.
+
+;; Submatch info
+;; -------------
+
+;; Keeping track of submatch info with a DFA is tricky business and can slow
+;; down the matcher or make it use algorithmically more memory
+;; (e.g. O(textsize)). Here are some approaches:
+
+;; - Reproduce what an NFA matcher would do: when compiling the DFA, keep
+;; track of the NFA nodes corresponding to each DFA node, and for every
+;; transition, check the mapping between "incoming NFA nodes" and
+;; "outgoing NFA nodes" to maintain the list of submatch-info (one element
+;; per NFA node).
+
+;; - Keep a log of the states traversed during matching, so at the end it
+;; can be used to reproduce the parse tree or submatch info, based on
+;; auxiliary tables constructed during the DFA construction.
+
+;; - Some submatch info can be maintained cheaply: basically a submatch
+;; position can be represented by a single global variable in the case
+;; where we have the following property: every ε transition in the NFA
+;; which corresponds to this submatch point has the following property:
+;; no other ε transition for this same submatch can be traversed between
+;; the text position where this transition is traversed and the position
+;; where the target NFA subgraph fails to match.
+
+;;
+
+;;; Code:
+
+(eval-when-compile (require 'cl-lib))
+
+(defun copy-char-table (ct1)
+ (let* ((subtype (char-table-subtype ct1))
+ (ct2 (make-char-table subtype)))
+ (map-char-table (lambda (c v) (set-char-table-range ct2 c v)) ct1)
+ (dotimes (i (or (get subtype 'char-table-extra-slots) 0))
+ (set-char-table-extra-slot ct2 i (char-table-extra-slot ct1 i)))
+ ct2))
+
+(defun lex--char-table->alist (ct)
+ (let ((res ()))
+ (map-char-table (lambda (k v)
+ (push (cons (if (consp k)
+ ;; If k is a cons cell, we have to
+ ;; copy it because map-char-table
+ ;; reuses it.
+ (cons (car k) (cdr k))
+ ;; Otherwise, create a trivial cons-cell
+ ;; so we have fewer cases to handle.
+ (cons k k))
+ v)
+ res))
+ ct)
+ res))
+
+(defun lex--merge-into (op al1 al2 ct)
+ (cl-assert (memq op '(and or orelse)))
+ ;; We assume that map-char-table calls its function with increasing
+ ;; `key' arguments.
+ (while (and al1 al2)
+ (let ((k1 (caar al1)) (k2 (caar al2)))
+ (cond
+ ;; Perfect overlap.
+ ((equal k1 k2)
+ (set-char-table-range ct k1
+ (lex--merge op (cdr (pop al1)) (cdr (pop al2)))))
+ ;; k1 strictly greater than k2.
+ ((and (consp k1) (consp k2) (> (car k1) (cdr k2)))
+ (let ((v (cdr (pop al1))))
+ (if (not (eq op 'and)) (set-char-table-range ct k1 v))))
+ ;; k2 strictly greater than k1.
+ ((and (consp k1) (consp k2) (> (car k2) (cdr k1)))
+ (let ((v (cdr (pop al2))))
+ (if (not (eq op 'and)) (set-char-table-range ct k2 v))))
+ ;; There's partial overlap.
+ ((and (consp k1) (consp k2) (> (cdr k1) (cdr k2)))
+ (if (not (eq op 'and))
+ (set-char-table-range ct (cons (1+ (cdr k2)) (cdr k1)) (cdar al1)))
+ (setcdr k1 (cdr k2)))
+ ((and (consp k1) (consp k2) (< (cdr k1) (cdr k2)))
+ (if (not (eq op 'and))
+ (set-char-table-range ct (cons (1+ (cdr k1)) (cdr k2)) (cdar al2)))
+ (setcdr k2 (cdr k1)))
+ ;; Now the tails are equal.
+ ((and (consp k1) (consp k2) (> (car k1) (car k2)))
+ (set-char-table-range ct k1 (lex--merge op (cdr (pop al1)) (cdar al2)))
+ (setcdr k2 (1- (car k1))))
+ ((and (consp k1) (consp k2) (< (car k1) (car k2)))
+ (set-char-table-range ct k2 (lex--merge op (cdar al1) (cdr (pop al2))))
+ (setcdr k1 (1- (car k2))))
+ (t (cl-assert nil)))))
+ (if (not (eq op 'and))
+ (dolist (x (or al1 al2))
+ (set-char-table-range ct (car x) (cdr x))))
+ ct)
+
+(defvar lex--states)
+(defvar lex--memoize)
+
+(defun lex--set-eq (l1 l2)
+ (let ((len (length l2)))
+ (setq l2 (copy-sequence l2))
+ (while (consp l1)
+ (cl-assert (= len (length l2)))
+ (unless (> len
+ (setq len (length (setq l2 (delq (pop l1) l2)))))
+ (setq l1 t)))
+ (not l1)))
+
+(define-hash-table-test 'lex--set-eq 'lex--set-eq
+ (lambda (l)
+ (let ((hash 0))
+ (while l
+ (let ((x (pop l)))
+ (if (memq x l) (progn (debug) nil)
+ (setq hash (+ hash (sxhash x))))))
+ hash)))
+
+
+(defun lex--flatten-state (state)
+ (cl-assert (memq (car state) '(and or orelse)))
+ (let ((op (car state))
+ (todo (cdr state))
+ (done (list state))
+ (res nil))
+ (while todo
+ (setq state (pop todo))
+ (cond
+ ((null state) (if (eq op 'and) (setq res nil todo nil)))
+ ((memq state done) nil)
+ ((eq (car-safe state) op)
+ (push state done)
+ (setq todo (append (cdr state) todo)))
+ (t (unless (memq state res) (push state res)))))
+ (cons op (nreverse res))))
+
+(defun lex--merge-2 (op lex1 lex2)
+ (cl-assert (memq op '(and or orelse)))
+ ;; The order between lex1 and lex2 matters: preference is given to lex1.
+ (cond
+ ;; `lex1' and `lex2' might actually be the same when we use this code to
+ ;; cancel out the `and' and the `join' from lex--merge-and-join.
+ ;; ((eq lex1 lex2) (debug) lex1) ;CHECK: ruled out by `lex--flatten-state'?
+ ;; ((equal lex1 lex2) lex1) ;Stack overflow :-(
+
+ ;; Handle the 2 possible nil cases.
+ ;; CHECK: ruled out by `lex--flatten-state'?
+ ((null lex1) (debug) (if (eq op 'and) nil lex2))
+ ((null lex2) (debug) (if (eq op 'and) nil lex1))
+
+ ;; Do the predicate cases before the `stop' because the stop should
+ ;; always come after the checks.
+ ;; TODO: add optimizations for pairs of `checks' which are redundant,
+ ;; or mutually exclusive, ... although we can also do it in lex-optimize.
+ ((and (eq (car lex1) 'check) (eq (car lex2) 'check)
+ (equal (nth 1 lex1) (nth 1 lex2))) ; Same predicate.
+ (cl-list* 'check (nth 1 lex1)
+ (lex--merge op (nth 2 lex1) (nth 2 lex2))
+ (lex--merge op (nthcdr 3 lex1) (nthcdr 3 lex2))))
+ ((eq (car lex1) 'check)
+ (cl-list* 'check (nth 1 lex1)
+ (lex--merge op (nth 2 lex1) lex2)
+ (lex--merge op (nthcdr 3 lex1) lex2)))
+ ((eq (car lex2) 'check)
+ (cl-list* 'check (nth 1 lex2)
+ (lex--merge op lex1 (nth 2 lex2))
+ (lex--merge op lex1 (nthcdr 3 lex2))))
+
+ ;; Joins have the form (join CONT . EXIT) where EXIT is a lexer
+ ;; corresponding to the rest of the regexp after the `and' sub-regexp.
+ ;; All the joins corresponding to the same `and' have the same EXIT.
+ ;; CONT is a lexer that contains another join inside, it corresponds to
+ ;; the decision to not yet leave the `and'.
+ ((and (eq (car lex1) 'join) (eq (car lex2) 'join))
+ (cl-assert (eq (cddr lex1) (cddr lex2))) ;Check they're the same join.
+ (let ((in (lex--merge op (cadr lex1) (cadr lex2))))
+ (if (eq op 'and)
+ ;; Eliminate the join once it was all merged.
+ ;; FIXME: This arbitrarily chooses `or' instead of `orelse',
+ ;; and it arbitrarily gives CONT precedence over EXIT.
+ (lex--merge 'or in (cddr lex1))
+ `(join ,in ,@(cddr lex1)))))
+ ;; If one the two lex's is a join but the other not, the other must
+ ;; contain a corresponding join somewhere inside.
+ ((eq (car lex1) 'join)
+ (let ((next (lex--merge op (nth 1 lex1) lex2)))
+ ;; lex1 is a valid exit point but lex2 isn't.
+ (if (eq op 'and)
+ next
+ ;; FIXME: lex1 is implicitly an `or(else)' between (cadr lex1) and
+ ;; (cddr lex1). Here we construct an `or(else)' between `next' and
+ ;; (cddr lex1). I.e. we lose the `op' and we do not preserve the
+ ;; ordering between lex2 and (cddr lex1).
+ `(join ,next ,@(cddr lex1)))))
+ ((eq (car lex2) 'join)
+ (let ((next (lex--merge op lex1 (nth 1 lex2))))
+ (if (eq op 'and) next `(join ,next ,@(cddr lex2)))))
+
+ ;; The three `stop' cases.
+ ((and (eq (car lex1) 'stop) (eq (car lex2) 'stop))
+ ;; Here is where we give precedence to `lex1'.
+ (if (eq op 'orelse) lex1
+ (cl-list* 'stop (cadr lex1) (lex--merge op (cddr lex1) (cddr lex2)))))
+ ((eq (car lex1) 'stop)
+ (let ((next (lex--merge op (cddr lex1) lex2)))
+ (pcase op
+ (`or (cl-list* 'stop (cadr lex1) next))
+ (`orelse lex1)
+ ;; CHECK: We should have hit a `join' before reaching a `stop'.
+ (`and (debug) next)
+ (_ (error "lex.el: got %S but expected one of or/and/orelse"
+ op)))))
+ ((eq (car lex2) 'stop)
+ (let ((next (lex--merge op lex1 (cddr lex2))))
+ ;; For `orelse', we want here to delay the `stop' until the point
+ ;; where we know that lex1 doesn't match. Sadly, I don't know how to
+ ;; do it.
+ (pcase op
+ ;; FIXME: One thing we can do is to mark the value attached to the
+ ;; `stop' so as to indicate that an earlier match may finish later.
+ ;; This way, if the match is not `earlystop', we know it's one of
+ ;; the leftmost ones, and maybe the search loop can avoid some work
+ ;; when determining which is the leftmost longest match.
+ (`orelse (cl-list* 'stop `(earlystop ,(cadr lex2)) next))
+ ((or `or `orelse) (cl-list* 'stop (cadr lex2) next))
+ ;; CHECK: We should have hit a `join' before reaching a `stop'.
+ (`and (debug) next)
+ (_ (error "lex.el: got %S but expected one of or/and/orelse"
+ op)))))
+
+ ;; The most general case.
+ ((and (eq (car lex1) 'table) (eq (car lex2) 'table))
+ (let ((al1 (lex--char-table->alist (cdr lex1)))
+ (al2 (lex--char-table->alist (cdr lex2)))
+ (ct (make-char-table 'lexer)))
+ (lex--merge-into op al1 al2 ct)
+ (cons 'table ct)))
+
+ ((and (characterp (car lex1)) (characterp (car lex2))
+ (eq (car lex1) (car lex2)))
+ (cons (car lex1) (lex--merge op (cdr lex1) (cdr lex2))))
+ ((and (characterp (car lex1)) (characterp (car lex2)))
+ (unless (eq op 'and)
+ (let ((ct (make-char-table 'lexer)))
+ (aset ct (car lex1) (cdr lex1))
+ (aset ct (car lex2) (cdr lex2))
+ (cons 'table ct))))
+ ((and (characterp (car lex1)) (eq (car lex2) 'table))
+ (let ((next (lex--merge op (cdr lex1) (aref (cdr lex2) (car lex1)))))
+ (if (eq op 'and)
+ (if next (cons (car lex1) next))
+ (let ((ct (copy-sequence (cdr lex2))))
+ (aset ct (car lex1) next)
+ (cons 'table ct)))))
+ ((and (eq (car lex1) 'table) (characterp (car lex2)))
+ (let ((next (lex--merge op (aref (cdr lex1) (car lex2)) (cdr lex2))))
+ (if (eq op 'and)
+ (if next (cons (car lex2) next))
+ (let ((ct (copy-sequence (cdr lex1))))
+ (aset ct (car lex2) next)
+ (cons 'table ct)))))
+
+ ((or (memq (car lex1) '(or orelse and)) ;state
+ (memq (car lex2) '(or orelse and))) ;state
+ ;; `state' nodes are nodes whose content is not known yet, so we
+ ;; have to delay the merge via the memoization table.
+ ;; `or' and `and' nodes should only happen when the other `op' is being
+ ;; performed, in which case we can't do the merge either before lex1
+ ;; and lex2 have both been merged.
+ (lex--merge op lex1 lex2))
+ (t (cl-assert nil))))
+
+(defun lex--merge-now (&rest state)
+ (cl-assert (memq (car state) '(and or orelse)))
+ ;; Re-flatten, in case one of the sub-states was changed.
+ (setq state (lex--flatten-state state))
+ (if (<= (length state) 2)
+ (if (eq (car state) 'and)
+ ;; Need to strip out the `join's.
+ (lex--merge-and-join (cadr state))
+ (cadr state))
+ (let ((op (pop state))
+ (res (pop state)))
+ (dolist (lex state)
+ ;; CHECK: we fold the lexers using left-associativity.
+ ;; For `orelse', that means that `earlystop' never accumulates,
+ ;; whereas if we folded in a right-associative way, we could get
+ ;; some (earlystop (earlystop (earlystop V))). Not sure which one's
+ ;; preferable, so let's stick with what we have for now.
+ (setq res (lex--merge-2 op res lex)))
+ res)))
+
+(defun lex--merge-and-join (lex)
+ (lex--merge-2 'and lex lex))
+
+
+(defun lex--merge (&rest state)
+ (cl-assert (memq (car state) '(and or orelse)))
+ (setq state (lex--flatten-state state))
+ (if (and (<= (length state) 2)
+ (not (eq (car state) 'and)))
+ (cadr state)
+ (or (gethash state lex--memoize)
+ (progn
+ ;; (debug)
+ (cl-assert (memq (car state) '(and or orelse)))
+ (push state lex--states)
+ ;; The `state' node will be later on modified via setcar/setcdr,
+ ;; se be careful to use a copy of it for the key.
+ (puthash (cons (car state) (cdr state)) state lex--memoize)
+ state))))
+
+(defun lex--compile-category (category)
+ (if (and (integerp category) (< category 128))
+ category
+ (if (symbolp category)
+ (if (= 1 (length (symbol-name category)))
+ (aref (symbol-name category) 0)
+ (require 'rx)
+ (defvar rx-categories)
+ (cdr (assq category rx-categories))))))
+
+(defun lex--compile-syntax (&rest syntaxes)
+ (mapcar (lambda (x)
+ (if (and (integerp x) (< x 32)) x
+ (if (symbolp x)
+ (setq x (if (= 1 (length (symbol-name x)))
+ (symbol-name x)
+ (require 'rx)
+ (defvar rx-syntax)
+ (cdr (assq x rx-syntax)))))
+ (if (characterp x) (setq x (string x)))
+ (car (string-to-syntax x))))
+ syntaxes))
+
+(defconst lex--char-classes
+ `((alnum alpha digit)
+ (alpha word (?a . ?z) (?A . ?Z))
+ (blank ?\s ?\t)
+ (cntrl (?\0 . ?\C-_))
+ (digit (?0 . ?9))
+ ;; Include all multibyte chars, plus all the bytes except 128-159.
+ (graph (?! . ?~) multibyte (#x3fffa0 . #x3fffff))
+ ;; src/regexp.c handles case-folding inconsistently: lower and upper
+ ;; match both lower- and uppercase ascii chars, but lower also matches
+ ;; uppercase non-ascii chars whereas upper does not match lowercase
+ ;; nonascii chars. Here I simply ignore case-fold for [:lower:] and
+ ;; [:upper:] because it's simpler and doesn't seem worse.
+ (lower (check (lex--match-lower)))
+ (upper (check (lex--match-upper)))
+ (print graph ?\s)
+ (punct (check (not (lex--match-syntax . ,(lex--compile-syntax "w"))))
+ (?! . ?/) (?: . ?@) (?\[ . ?`) (?\{ . ?~))
+ (space (check (lex--match-syntax . ,(lex--compile-syntax " "))))
+ (xdigit digit (?a . ?f) (?A . ?F))
+ (ascii (?\0 . ?\177))
+ (nonascii (?\200 . #x3fffff))
+ (unibyte ascii (#x3fff00 . #x3fffff))
+ (multibyte (#x100 . #x3ffeff))
+ (word (check (lex--match-syntax . ,(lex--compile-syntax "w"))))
+ ;; `rx' alternative names.
+ (numeric digit)
+ (num digit)
+ (control cntrl)
+ (hex-digit xdigit)
+ (hex xdigit)
+ (graphic graph)
+ (printing print)
+ (alphanumeric alnum)
+ (letter alpha)
+ (alphabetic alpha)
+ (lower-case lower)
+ (upper-case upper)
+ (punctuation punct)
+ (whitespace space)
+ (white space))
+ "Definition of char classes.
+Each element has the form (CLASS . DEFINITION) where definition
+is a list of elements that can be either CHAR or (CHAR . CHAR),
+or CLASS (another char class) or (check (PREDICATE . ARG))
+or (check (not (PREDICATE . ARG))).")
+
+(defvar lex--char-equiv-table nil
+ "Equiv-case table to use to compile case-insensitive regexps.")
+
+(defun lex--char-equiv (char)
+ (when lex--char-equiv-table
+ (let ((chars ())
+ (tmp char))
+ (while (and (setq tmp (aref lex--char-equiv-table tmp))
+ (not (eq tmp char)))
+ (push tmp chars))
+ (if chars (cons char chars)))))
+
+
+(defun lex--nfa (re state)
+ (cl-assert state) ;If `state' is nil we can't match anyway.
+ (cond
+ ((characterp re)
+ (let ((chars (lex--char-equiv re)))
+ (if (null chars)
+ (cons re state)
+ (let ((ct (make-char-table 'lexer)))
+ (dolist (char chars) (aset ct char state))
+ (cons 'table ct)))))
+ ((stringp re)
+ (if (null lex--char-equiv-table)
+ ;; (Very) minor optimization.
+ (nconc (mapcar 'identity re) state)
+ (lex--nfa `(seq ,@(mapcar 'identity re)) state)))
+ (t
+ (pcase (or (car-safe re) re)
+ ((or `: `seq `sequence
+ ;; Hack!
+ `group)
+ (dolist (elem (reverse (cdr re)))
+ (setq state (lex--nfa elem state)))
+ state)
+ ((or `char `in `not-char)
+ (let ((chars (cdr re))
+ (checks nil)
+ (fail nil)
+ (char nil) ;The char seen, or nil if none, or t if more than one.
+ (ct (make-char-table 'lexer)))
+ (when (or (eq 'not (car chars)) (eq 'not-char (car re)))
+ (setq chars (cdr chars))
+ (set-char-table-range ct t state)
+ (setq fail state)
+ (setq state nil))
+ (while chars
+ (let ((range (pop chars)))
+ (cond
+ ((stringp range)
+ (setq chars (append (cdr (lex--parse-charset range)) chars)))
+ ((symbolp range)
+ (setq range (or (cdr (assq range lex--char-classes))
+ (error "Uknown char class `%s'" range)))
+ (setq chars (append range chars)))
+ ((and (consp range) (eq 'check (car range)))
+ (push (cadr range) checks))
+ (t
+ (setq char (if (or char (not (characterp range))
+ (and lex--char-equiv-table
+ (lex--char-equiv range)))
+ t range))
+ ;; Set the range, first, regardless of case-folding. This is
+ ;; important because case-tables like to be set with few
+ ;; large ranges rather than many small ones, as is done in
+ ;; the case-fold loop.
+ (set-char-table-range ct range state)
+ (when (and lex--char-equiv-table
+ ;; Avoid looping over all characters.
+ (not (equal range '(#x100 . #x3ffeff))))
+ ;; Add all the case-equiv chars.
+ (let ((i (if (consp range) (car range) range))
+ (max (if (consp range) (cdr range) range))
+ char)
+ (while (<= i max)
+ (setq char i)
+ (while (and (setq char (aref lex--char-equiv-table char))
+ (not (eq char i)))
+ (aset ct char state))
+ (setq i (1+ i)))))))))
+
+ (let ((res (if (or (eq char t) fail)
+ (cons 'table ct)
+ (if char (cons char state)))))
+ (if (and (not fail) checks)
+ (setq state (lex--nfa 'anything state)))
+ (dolist (check checks)
+ (setq res
+ (if fail
+ ;; We do an `and' of the negation of the check and res.
+ (if (eq (car-safe check) 'not)
+ (list 'check (cadr check) res)
+ (cl-list* 'check check nil res))
+ ;; An `or' of the check and res.
+ (if (eq (car-safe check) 'not)
+ (list 'check (cadr check) res state)
+ (cl-list* 'check check state res)))))
+ res)))
+
+ ((or `union `or `| `orelse)
+ (let ((newstate
+ (cons (if (eq (car re) 'orelse) 'orelse 'or)
+ (mapcar (lambda (re) (lex--nfa re state)) (cdr re)))))
+ (push newstate lex--states)
+ newstate))
+
+ ((or `inter `intersection `&)
+ (if (<= (length re) 2)
+ ;; Avoid constructing degenerate `and' nodes.
+ (lex--nfa (cadr re) state)
+ ;; Just using `and' is not enough because we have to enforce that the
+ ;; sub-regexps (rather than the whole regexp) match the same string.
+ ;; So we need to mark the juncture point.
+ (let* ((join `(join nil ,@state))
+ (newstate
+ `(and ,@(mapcar (lambda (re) (lex--nfa re join)) (cdr re)))))
+ (push newstate lex--states)
+ newstate)))
+
+ ((or `0+ `zero-or-more `* `*\?)
+ (let ((newstate (list 'state)))
+ (let ((lexer (lex--nfa (cons 'seq (cdr re)) newstate)))
+ (setcdr newstate (if (memq (car re) '(*\?))
+ (list state lexer)
+ (list lexer state))))
+ (setcar newstate (if (memq (car re) '(*\?)) 'orelse 'or))
+ (push newstate lex--states)
+ newstate))
+
+ ((or `string-end `eos `eot `buffer-end `eob)
+ `(check (lex--match-eobp) ,state))
+ ((or `string-start `bos `bot `buffer-start `bob)
+ `(check (lex--match-bobp) ,state))
+ ((or `line-end `eol) `(check (lex--match-eolp) ,state))
+ ((or `line-start `bol) `(check (lex--match-bolp) ,state))
+ ((or `word-start `bow) `(check (lex--match-bowp) ,state))
+ ((or `word-end `eow) `(check (lex--match-eowp) ,state))
+ (`symbol-start `(check (lex--match-bosp) ,state))
+ (`symbol-end `(check (lex--match-eosp) ,state))
+ (`not-word-boundary `(check (lex--match-not-word-boundary) ,state))
+ (`word-boundary `(check (lex--match-not-word-boundary) nil . ,state))
+ (`syntax `(check (lex--match-syntax
+ . ,(apply 'lex--compile-syntax (cdr re)))
+ ,(lex--nfa 'anything state)))
+ (`not-syntax `(check (lex--match-syntax
+ . ,(apply 'lex--compile-syntax (cdr re)))
+ nil . ,(lex--nfa 'anything state)))
+ (`category `(check (lex--match-category
+ . ,(lex--compile-category (cadr re)))
+ ,(lex--nfa 'anything state)))
+ (`not-category `(check (lex--match-category
+ . ,(lex--compile-category (cadr re)))
+ nil . ,(lex--nfa 'anything state)))
+
+ ;; `rx' accepts char-classes directly as regexps. Let's reluctantly
+ ;; do the same.
+ ((or `digit `numeric `num `control `cntrl `hex-digit `hex `xdigit `blank
+ `graphic `graph `printing `print `alphanumeric `alnum `letter
+ `alphabetic `alpha `ascii `nonascii `lower `lower-case `upper
+ `upper-case `punctuation `punct `space `whitespace `white)
+ (lex--nfa `(char ,re) state))
+
+ (`case-sensitive
+ (let ((lex--char-equiv-table nil))
+ (lex--nfa `(seq ,@(cdr re)) state)))
+
+ (`case-fold
+ (let ((lex--char-equiv-table (get-eqvcase-table (current-case-table))))
+ (lex--nfa `(seq ,@(cdr re)) state)))
+
+ ((or `point
+ ;; Sub groups!
+ `submatch `group `backref
+ ;; Greediness control
+ `minimal-match `maximal-match)
+ (error "`%s' Not implemented" (or (car-safe re) re)))
+
+ ((or `not-newline `nonl `dot) (lex--nfa '(char not ?\n) state))
+ (`anything (lex--nfa '(char not) state))
+ ((or `word `wordchar) (lex--nfa '(syntax w) state))
+ (`not-wordchar (lex--nfa '(not-syntax w) state))
+
+ (`any
+ ;; `rx' uses it for (char ...) sets, and sregex uses it for `dot'.
+ (lex--nfa (if (consp re) (cons 'char (cdr re)) '(char not ?\n)) state))
+
+ (`negate
+ ;; We could define negation directly on regexps, but it's easier to
+ ;; do it on NFAs since those have fewer cases to deal with.
+ (let ((posnfa
+ ;; Trow away the mergable states generated while computing the
+ ;; posnfa, since it's only an intermediate datastructure.
+ (let (lex--states)
+ (lex--nfa `(seq ,@(cdr re)) '(stop negate)))))
+ (lex-negate posnfa state)))
+
+ (`not
+ ;; The `not' as used in `rx' should be deprecated so we can make it
+ ;; an alias for `negate', whose semantics is different. E.g.
+ ;; (negate (char ...)) matches the empty string and 2-char strings.
+ (setq re (cadr re))
+ (pcase (or (car-safe re) re)
+ (`word-boundary
+ (message "`not' deprecated: use not-word-boundary")
+ (lex--nfa 'not-word-boundary state))
+ ((or `any `in `char)
+ (message "`not' deprecated: use (%s not ...)" (or (car-safe re) re))
+ (lex--nfa (cl-list* (car re) 'not (cdr re)) state))
+ ((or `category `syntax)
+ (message "`not' deprecated: use not-%s" (car re))
+ (lex--nfa (cons (intern (format "not-%s" (car re))) (cdr re)) state))
+ (elem (error "lex.el: unexpected argument `%S' to `not'." elem))))
+
+ (`and
+ ;; `rx' defined `and' as `sequence', but we may want to define it
+ ;; as intersection instead.
+ (error "`and' is deprecated, use `seq', `:', or `sequence' instead"))
+
+ ((or `1+ `one-or-more `+ `+\?)
+ (lex--nfa `(seq (seq ,@(cdr re))
+ (,(if (memq (car re) '(+\?)) '*\? '0+) ,@(cdr re)))
+ state))
+ ((or `opt `zero-or-one `optional `\?)
+ (lex--nfa `(or (seq ,@(cdr re)) "") state))
+ (`\?\?
+ (lex--nfa `(orelse "" (seq ,@(cdr re))) state))
+ ((or `repeat `** `=)
+ (let ((min (nth 1 re))
+ (max (nth 2 re))
+ (res (nthcdr 3 re)))
+ (unless res
+ (setq res (list max)) (setq max min))
+ (lex--nfa `(seq ,@(append (make-list (or min 0)
+ (if (eq (length res) 1)
+ (car res)
+ (cons 'seq res)))
+ (if (null max)
+ `((0+ ,@res))
+ (make-list (- max (or min 0))
+ `(opt ,@res)))))
+ state)))
+ (`>= (lex--nfa `(repeat ,(nth 1 re) nil ,@(nthcdr 2 re)) state))
+
+ ((or `bre `re `ere)
+ (lex--nfa (lex-parse-re (nth 1 re) (car re)) state))
+ (elem (error "lex.el: unknown RE element %S" elem))))))
+
+(defun lex--negate-inftail (state howmany)
+ ;; We hashcons the infinite tails and store them in the memoize table.
+ ;; This is an abuse, but saves us from passing it around as an
+ ;; extra argument.
+ (let ((inftail-1+ (gethash state lex--memoize)))
+ (unless inftail-1+
+ ;; Precompute the final infinitely repeating tail.
+ (setq inftail-1+ `(table . ,(make-char-table 'lexer)))
+ (set-char-table-range (cdr inftail-1+) t `(or ,state ,inftail-1+))
+ (push (aref (cdr inftail-1+) 0) lex--states)
+ (puthash state inftail-1+ lex--memoize))
+ (pcase howmany
+ (`1+ inftail-1+)
+ (`0+ (aref (cdr inftail-1+) 0))
+ (_ (error "lex.el: howmany is `%S' instead of one of 1+/0+" howmany)))))
+
+(defun lex--negate-now (nfa state)
+ (pcase (car nfa)
+ (`nil (lex--negate-inftail state '0+))
+ (`check
+ `(check ,(nth 1 nfa) ,(lex--negate-memo (nth 2 nfa) state)
+ ,@(lex--negate-memo (nthcdr 3 nfa) state)))
+ (`stop
+ (if (cddr nfa)
+ ;; This is valid but should normally not happen.
+ (lex--negate-now `(or (stop ,(cadr nfa)) ,(cddr nfa)) state)
+ (lex--negate-inftail state '1+)))
+
+ ((or `or `orelse)
+ (let ((join `(join nil . ,state)))
+ `(and ,@(mapcar (lambda (nfa) (lex--negate-memo nfa join)) (cdr nfa)))))
+
+ (`and
+ `(or ,@(mapcar (lambda (nfa) (lex--negate-memo nfa state)) (cdr nfa))))
+
+ (`join
+ ;; The join says: either exit the `and' because we matched all branches,
+ ;; or keep matching further. Negation makes the synchrony between
+ ;; `and' branches irrelevant, so we can consider it as an `or(else)'.
+ (if (cadr nfa)
+ ;; This is valid but should normally not happen.
+ (lex--negate-now `(or ,(cadr nfa) ,(cddr nfa)) state)
+ (lex-negate (cddr nfa) state)))
+ (_
+ (let ((ct (make-char-table 'lexer)))
+ ;; Get inftail-0+ from the hashtable.
+ (set-char-table-range ct t (lex--negate-inftail state '0+))
+ (if (characterp (car nfa))
+ (aset ct (car nfa) (lex--negate-memo (cdr nfa) state))
+ (cl-assert (eq 'table (car nfa)))
+ (map-char-table (lambda (range nfa)
+ (set-char-table-range ct range
+ (lex--negate-memo nfa state)))
+ (cdr nfa)))
+ `(or ,state (table ,@ct))))))
+
+(defun lex--negate-memo (nfa state)
+ ;; Make sure our `inftail' abuse of the hastable doesn't break anything.
+ (cl-assert (not (eq nfa state)))
+ (or (gethash nfa lex--memoize)
+ (let ((newstate (cons 'state nil)))
+ (puthash nfa newstate lex--memoize)
+ (let ((res (lex--negate-now nfa state)))
+ (when (memq (car res) '(or and orelse))
+ (push newstate lex--states))
+ (if (null res)
+ (setq res '(?a))
+ (setcar newstate (car res))
+ (setcdr newstate (cdr res))
+ newstate)))))
+
+(defun lex-negate (nfa state)
+ "Concatenate the negation of NFA with STATE.
+Returns a new NFA."
+ (let ((lex--memoize (make-hash-table :test 'eq)))
+ (lex--negate-memo nfa state)))
+
+(defun lex--dfa-wrapper (f)
+ (let* ((lex--states ())
+ (res (funcall f))
+ (postponed ())
+ (lex--memoize (make-hash-table :test 'lex--set-eq))
+ (states-dfa (make-hash-table :test 'eq)))
+
+ (while lex--states
+ (dolist (state (prog1 lex--states (setq lex--states nil)))
+ (let ((merged (apply 'lex--merge-now state)))
+ (if (memq (car merged) '(and or orelse))
+ ;; The merge could not be performed for some reason:
+ ;; let's re-schedule it.
+ (push state postponed)
+ (puthash state merged states-dfa))))
+
+ (unless lex--states
+ ;; If states-dfa is empty it means we haven't made any progress,
+ ;; so we're stuck in an infinite loop. Hopefully this cannot happen?
+ (cl-assert (not (zerop (hash-table-count states-dfa))))
+ (maphash (lambda (k v)
+ (unless v
+ ;; With `intersection', lex--merge may end up returning
+ ;; nil if the intersection is empty, so `v' can be
+ ;; nil here. In since `k' is necessarily a cons cell,
+ ;; we can't turn it into nil, so we turn it into
+ ;; a more costly lexer that also fails for all inputs.
+ (setq v '(?a)))
+ (setcar k (car v))
+ (setcdr k (cdr v)))
+ states-dfa)
+ (clrhash states-dfa)
+ (setq lex--states postponed)
+ (setq postponed nil)))
+
+ res))
+
+(defun lex-compile (alist)
+ (lex--dfa-wrapper
+ (lambda ()
+ (let* ((lex--char-equiv-table
+ (if case-fold-search (get-eqvcase-table (current-case-table))))
+ (newstate
+ `(or
+ ,@(mapcar (lambda (x) (lex--nfa (car x) (list 'stop (cdr x))))
+ alist))))
+ (push newstate lex--states)
+ newstate))))
+
+(defun lex-search-dfa (match-dfa)
+ ;; This constructs a search-DFA whose last match should be the leftmost
+ ;; longest match.
+ (lex--dfa-wrapper
+ (lambda ()
+ (lex--nfa '(*\? (char not)) match-dfa))))
+
+
+(defun lex--terminate-if (new old)
+ (cond
+ ((eq new t) t)
+ ((eq old t) t)
+ (t (while new (let ((x (pop new))) (if (not (memq x old)) (push x old))))
+ old)))
+
+(defun lex--optimize-1 (lexer)
+ (let ((terminate nil))
+ (cons
+ (pcase (car lexer)
+ (`table
+ (let ((ct (cdr lexer))
+ (char nil))
+ ;; Optimize each entry.
+ (map-char-table
+ (lambda (range v)
+ (let ((cell (lex--optimize v)))
+ (setq terminate (lex--terminate-if (cdr cell) terminate))
+ (set-char-table-range ct range (car cell))))
+ ct)
+ ;; Optimize the internal representation of the table.
+ (optimize-char-table (cdr lexer) 'eq)
+ ;; Eliminate the table if possible.
+ (map-char-table
+ (lambda (range v)
+ (setq char
+ (if (and (characterp range) (null char))
+ range t)))
+ ct)
+ (pcase char
+ (`nil nil)
+ (`t lexer)
+ (_ (setcar lexer 'char) (setcdr lexer (aref ct char)) lexer))))
+ (`stop
+ (let ((cell (lex--optimize (cddr lexer))))
+ (setq terminate t)
+ (setf (cddr lexer) (car cell)))
+ lexer)
+ (`check
+ (let* ((test (nth 1 lexer))
+ (cellf (lex--optimize (nthcdr 3 lexer)))
+ (fail (setf (nthcdr 3 lexer) (car cellf)))
+ (cells (lex--optimize (nth 2 lexer)))
+ (succ (setf (nth 2 lexer) (car cells))))
+ (setq terminate (lex--terminate-if (cdr cellf) terminate))
+ (setq terminate (lex--terminate-if (cdr cells) terminate))
+ ;; TODO: the check-optimizations below only work on consecutive
+ ;; pairs of checks. We need to be more agressive and make sure
+ ;; the optimized DFA never does twice the same test at the same
+ ;; position. Most importantly: don't do the same test in
+ ;; a tight loop as in "(^\<)*".
+ (when (eq 'check (car succ))
+ (cond
+ ((equal test (nth 1 succ)) ;Same successful test.
+ (setf (nth 2 lexer) (setq succ (nth 2 succ))))
+ ;; TODO: we can add rules such as bobp -> eolp,
+ ;; bosp -> bowp, (syntax X) -> (syntax Y X), ...
+ ))
+ (when (eq 'check (car fail))
+ (cond
+ ((equal test (nth 1 fail)) ;Same failing test.
+ (setf (nthcdr 3 lexer) (setq fail (nthcdr 3 succ))))
+ ;; TODO: we can add rules such as !eolp -> !bobp,
+ ;; !bowp -> !bosp, !(syntax Y X) -> !(syntax X), ...
+ ))
+ (if (or succ fail) lexer)))
+ (_
+ (cl-assert (characterp (car lexer)))
+ (let ((cell (lex--optimize (cdr lexer))))
+ (setq terminate (lex--terminate-if (cdr cell) terminate))
+ (if (setf (cdr lexer) (car cell))
+ lexer))))
+ (if (consp terminate)
+ (delq lexer terminate)
+ terminate))))
+
+(defun lex--optimize (lexer)
+ (when lexer
+ ;; The lex--memoize cache maps lexer states to (LEXER . TERMINATE) where
+ ;; TERMINATE is either t to say that LEXER can terminate or a list of
+ ;; lexers which means that LEXER terminates only if one of the lexers in
+ ;; the list terminates.
+ (let ((cache (gethash lexer lex--memoize)))
+ (if cache
+ ;; Optimize (char C) to nil.
+ (if (and (characterp (caar cache)) (null (cdar cache))) nil cache)
+ ;; Store a value indicating that we're in the process of computing it,
+ ;; so when we encounter a loop, we don't recurse indefinitely.
+ ;; Not knowing any better, we start by stating the tautology that
+ ;; `lexer' terminates if and only if `lexer' terminates.
+ (let ((cell (cons lexer (list lexer))))
+ (puthash lexer cell lex--memoize)
+ (let ((res (lex--optimize-1 lexer)))
+ (if (and (car res) (cdr res))
+ res
+ (setcar lexer ?a)
+ (setcdr lexer nil)
+ (puthash lexer '(nil) lex--memoize)
+ nil)))))))
+
+(defun lex-optimize (lexer)
+ (let ((lex--memoize (make-hash-table :test 'eq)))
+ (prog1 (car (lex--optimize lexer))
+ (message "Visited %d states" (hash-table-count lex--memoize)))))
+
+(defmacro lex-case (object posvar &rest cases)
+ (declare (indent 2))
+ (let* ((i -1)
+ (alist (mapcar (lambda (case) (cons (car case) (cl-incf i))) cases))
+ (lex (lex-compile alist))
+ (tmpsym (make-symbol "tmp")))
+ (setq i -1)
+ `(let ((,tmpsym (lex-match-string ',lex ,object ,posvar)))
+ (pcase (car ,tmpsym)
+ ,@(mapcar (lambda (case)
+ `(,(cl-incf i)
+ (set-match-data
+ (list ,posvar (setq ,posvar (cadr ,tmpsym))))
+ ,@(cdr case)))
+ cases)))))
+
+;;; Matching engine
+
+(defun lex--match-bobp (arg pos &optional string)
+ (= pos (if string 0 (point-min))))
+
+(defun lex--match-eobp (arg pos &optional string)
+ (= pos (if string (length string) (point-max))))
+
+(defun lex--match-bolp (arg pos &optional string)
+ (if string (or (= pos 0) (eq ?\n (aref string (1- pos))))
+ (memq (char-before pos) '(nil ?\n))))
+
+(defun lex--match-eolp (arg pos &optional string)
+ (if string (or (= pos (length string)) (eq ?\n (aref string pos)))
+ (memq (char-after pos) '(nil ?\n))))
+
+(defun lex--match-bowp (arg pos &optional string)
+ (and (not (if string (and (> pos 0)
+ (eq ?w (char-syntax (aref string (1- pos)))))
+ (and (> pos (point-min)) (eq 2 (car (syntax-after (1- pos)))))))
+ (if string (and (< pos (length string))
+ (eq ?w (char-syntax (aref string pos))))
+ (eq 2 (car (syntax-after pos))))))
+
+(defun lex--match-eowp (arg pos &optional string)
+ (and (if string (and (> pos 0)
+ (eq ?w (char-syntax (aref string (1- pos)))))
+ (and (> pos (point-min)) (eq 2 (car (syntax-after (1- pos))))))
+ (not (if string (and (< pos (length string))
+ (eq ?w (char-syntax (aref string pos))))
+ (eq 2 (car (syntax-after pos)))))))
+
+(defun lex--match-bosp (arg pos &optional string)
+ (and (not (if string
+ (and (> pos 0)
+ (memq (char-syntax (aref string (1- pos))) '(?w ?_)))
+ (and (> pos (point-min))
+ (memq (car (syntax-after (1- pos))) '(2 3)))))
+ (if string (and (< pos (length string))
+ (memq (char-syntax (aref string pos)) '(?w ?_)))
+ (memq (car (syntax-after pos)) '(2 3)))))
+
+(defun lex--match-eosp (arg pos &optional string)
+ (and (if string (and (> pos 0)
+ (memq (char-syntax (aref string (1- pos))) '(?w ?_)))
+ (and (> pos (point-min)) (memq (car (syntax-after (1- pos))) '(2 3))))
+ (not (if string (and (< pos (length string))
+ (memq (char-syntax (aref string pos)) '(?w ?_)))
+ (memq (car (syntax-after pos)) '(2 3))))))
+
+(defun lex--match-not-word-boundary (arg pos &optional string)
+ (eq (if string (and (> pos 0)
+ (eq ?w (char-syntax (aref string (1- pos)))))
+ (and (> pos (point-min)) (eq 2 (car (syntax-after (1- pos))))))
+ (if string (and (< pos (length string))
+ (eq ?w (char-syntax (aref string pos))))
+ (eq 2 (car (syntax-after pos))))))
+
+(defun lex--match-upper (arg pos &optional string)
+ (when (< pos (if string (length string) (point-max)))
+ (let ((char (if string (aref string pos) (char-after pos))))
+ (not (eq (downcase char) char)))))
+
+(defun lex--match-lower (arg pos &optional string)
+ (when (< pos (if string (length string) (point-max)))
+ (let ((char (if string (aref string pos) (char-after pos))))
+ (not (eq (upcase char) char)))))
+
+
+(defun lex--match-category (category pos &optional string)
+ (when (< pos (if string (length string) (point-max)))
+ (aref (char-category-set (if string (aref string pos)
+ (char-after pos)))
+ category)))
+
+(defun lex--match-syntax (syntaxes pos &optional string)
+ (when (< pos (if string (length string) (point-max)))
+ (memq (car (if string (aref (syntax-table) (aref string pos))
+ (syntax-after pos)))
+ syntaxes)))
+
+
+(defun lex-match-string (lex string &optional start stop)
+ "Match LEX against STRING between START and STOP.
+Return a triplet (VALUE ENDPOS . LEXER) where VALUE is the
+value of returned by the lexer for the match found (or nil), ENDPOS
+is the end position of the match found (or nil), and LEXER is the
+state of the engine at STOP, which can be passed back to
+`lex-match-string' to continue the match elsewhere."
+ ;; FIXME: Move this to C.
+ (unless start (setq start 0))
+ (unless stop (setq stop (length string)))
+ (let ((match (list nil nil))
+ (lastlex lex))
+ (while
+ (progn
+ (while (eq (car lex) 'check)
+ (setq lex (if (funcall (car (nth 1 lex)) (cdr (nth 1 lex))
+ start string)
+ (nth 2 lex) (nthcdr 3 lex))))
+ (when (eq (car lex) 'stop)
+ ;; Don't stop yet, we're looking for the longest match.
+ (setq match (list (cadr lex) start))
+ (message "Found match: %s" match)
+ (setq lex (cddr lex)))
+ (cl-assert (not (eq (car lex) 'stop)))
+ (and lex (< start stop)))
+ (let ((c (aref string start)))
+ (setq start (1+ start))
+ (setq lex (cond
+ ((eq (car lex) 'table) (aref (cdr lex) c))
+ ((integerp (car lex)) (if (eq c (car lex)) (cdr lex)))))
+ (setq lastlex lex)))
+ (message "Final search pos considered: %s" start)
+ ;; The difference between `lex' and `lastlex' is basically that `lex'
+ ;; may depend on data after `stop' (if there was an `end-of-file' or
+ ;; `word-boundary' or basically any `check'). So let's return `lastlex'
+ ;; so it can be correctly used to continue the match with a different
+ ;; content than what's after `stop'.
+ (nconc match lastlex)))
+
+(defun lex-match-string-first (lex string &optional start stop)
+ "Match LEX against STRING between START and STOP.
+Return a triplet (VALUE ENDPOS . LEXER) where VALUE is the
+value of returned by the lexer for the match found (or nil), ENDPOS
+is the end position of the match found (or nil), and LEXER is the
+state of the engine at STOP, which can be passed back to
+`lex-match-string' to continue the match elsewhere."
+ ;; FIXME: Move this to C.
+ (unless start (setq start 0))
+ (unless stop (setq stop (length string)))
+ (let ((match (list nil nil))
+ (lastlex lex))
+ (catch 'found
+ (while
+ (progn
+ (while (eq (car lex) 'check)
+ (setq lex (if (funcall (car (nth 1 lex)) (cdr (nth 1 lex))
+ start string)
+ (nth 2 lex) (nthcdr 3 lex))))
+ (when (eq (car lex) 'stop)
+ (throw 'found (cl-list* (cadr lex) start (cddr lex))))
+ (cl-assert (not (eq (car lex) 'stop)))
+ (and (not match) lex (< start stop)))
+ (let ((c (aref string start)))
+ (setq start (1+ start))
+ (setq lex (cond
+ ((eq (car lex) 'table) (aref (cdr lex) c))
+ ((integerp (car lex)) (if (eq c (car lex)) (cdr lex)))))
+ (setq lastlex lex)))
+ ;; The difference between `lex' and `lastlex' is basically that `lex'
+ ;; may depend on data after `stop' (if there was an `end-of-file' or
+ ;; `word-boundary' or basically any `check'). So let's return `lastlex'
+ ;; so it can be correctly used to continue the match with a different
+ ;; content than what's after `stop'.
+ (cl-list* nil start lastlex))))
+
+(defun lex-match-buffer (lex &optional stop)
+ "Match LEX against buffer between point and STOP.
+Return a triplet (VALUE ENDPOS . LEXER) where VALUE is the
+value of returned by the lexer for the match found (or nil), ENDPOS
+is the end position of the match found (or nil), and LEXER is the
+state of the engine at STOP, which can be passed back to
+continue the match elsewhere."
+ ;; FIXME: Move this to C.
+ (unless stop (setq stop (point-max)))
+ (let ((start (point))
+ (match (list nil nil))
+ (lastlex lex))
+ (while
+ (progn
+ (while (eq (car lex) 'check)
+ (setq lex (if (funcall (car (nth 1 lex)) (cdr (nth 1 lex))
+ start)
+ (nth 2 lex) (nthcdr 3 lex))))
+ (when (eq (car lex) 'stop)
+ ;; Don't stop yet, we're looking for the longest match.
+ (setq match (list (cadr lex) start))
+ (message "Found match: %s" match)
+ (setq lex (cddr lex)))
+ (cl-assert (not (eq (car lex) 'stop)))
+ (and lex (< start stop)))
+ (let ((c (char-after start)))
+ (setq start (1+ start))
+ (setq lex (cond
+ ((eq (car lex) 'table) (aref (cdr lex) c))
+ ((integerp (car lex)) (if (eq c (car lex)) (cdr lex)))))
+ (setq lastlex lex)))
+ (message "Final search pos considered: %s" start)
+ ;; The difference between `lex' and `lastlex' is basically that `lex'
+ ;; may depend on data after `stop' (if there was an `end-of-file' or
+ ;; `word-boundary' or basically any `check'). So let's return `lastlex'
+ ;; so it can be correctly used to continue the match with a different
+ ;; content than what's after `stop'.
+ (nconc match lastlex)))
+
+;;; Regexp parsers.
+
+(defun lex--tokenizer (lex string)
+ (let ((tokens ())
+ (i 0)
+ tmp)
+ (while (and (< i (length string))
+ (setq tmp (lex-match-string lex string i)))
+ (push (cons (car tmp) (substring string i (setq i (cadr tmp)))) tokens))
+ (nreverse tokens)))
+
+(eval-when-compile
+ (unless (fboundp 'lex-compile) (load "lex" 'noerror 'nomessage)))
+
+(defun lex--parse-charset (string)
+ (let ((i 0)
+ (ranges ()))
+ (when (eq (aref string i) ?^)
+ (push 'not ranges)
+ (setq i (1+ i)))
+ (let ((op nil)
+ (case-fold-search nil))
+ (while (not (eq op 'stop))
+ (lex-case string i
+ ((seq "[:" (0+ (char (?a . ?z) (?A . ?Z))) ":]")
+ (push (intern (substring string (+ 2 (match-beginning 0))
+ (- (match-end 0) 2)))
+ ranges))
+ ((seq anything "-" anything)
+ (push (cons (aref string (match-beginning 0))
+ (aref string (1- (match-end 0))))
+ ranges))
+ (anything (push (aref string (1- (match-end 0))) ranges))
+ (eob (setq op 'stop))))
+
+ `(char ,@(nreverse ranges)))))
+
+(defconst lex--parse-re-lexspec
+ '(((or "*" "+" "?" "*?" "+?" "??") . suffix)
+ ((seq "[" (opt "^") (opt "]")
+ (0+ (or (seq (char not ?\]) "-" (char not ?\]))
+ (seq "[:" (1+ (char (?a . ?z) (?A . ?Z))) ":]")
+ (char not ?\]))) "]") . charset)
+ ((seq "\\c" anything) . category)
+ ((seq "\\C" anything) . not-category)
+ ((seq "\\s" anything) . syntax)
+ ((seq "\\S" anything) . not-syntax)
+ ((seq "\\" (char (?1 . ?9))) . backref)
+ ("\\'" . eob)
+ ("\\`" . bob)
+ ("." . dot)
+ ("^" . bol)
+ ("$" . eol)
+ ("." . dot)
+ ("\\<" . bow)
+ ("\\>" . eow)
+ ("\\_<" . symbol-start)
+ ("\\_>" . symbol-end)
+ ("\\w" . wordchar)
+ ("\\W" . not-wordchar)
+ ("\\b" . word-boundary)
+ ("\\B" . not-word-boundary)
+ ("\\=" . point)
+ ((or (seq ?\\ anything) anything) . char)))
+
+
+(defconst lex--parse-ere-lexer
+ (let ((case-fold-search nil))
+ (lex-compile
+ (append '(("(?:" . shy-group)
+ ("|" . or)
+ ((seq "{" (0+ (char (?0 . ?9)))
+ (opt (seq "," (0+ (char (?0 . ?9))))) "}") . repeat)
+ ((or ")" eob) . stop)
+ ("(" . group))
+ lex--parse-re-lexspec))))
+
+(defconst lex--parse-bre-lexer
+ (let ((case-fold-search nil))
+ (lex-compile
+ (append '(("\\(?:" . shy-group)
+ ("\\|" . or)
+ ((seq "\\{" (0+ (char (?0 . ?9)))
+ (opt (seq "," (0+ (char (?0 . ?9))))) "\\}") . repeat)
+ ((or "\\)" eob) . stop)
+ ("\\(" . group))
+ lex--parse-re-lexspec))))
+
+(defun lex--parse-re (string i lexer)
+ (let ((stack ())
+ (op nil)
+ (res nil)
+ tmp)
+ (while (and (not (eq op 'stop))
+ (setq tmp (lex-match-string lexer string i)))
+ (pcase (car tmp)
+ (`shy-group
+ (setq tmp (lex--parse-re string (cadr tmp) lexer))
+ (unless (eq (aref string (1- (cadr tmp))) ?\))
+ (error "Unclosed shy-group"))
+ (push (car tmp) res))
+ (`group
+ (setq tmp (lex--parse-re string (cadr tmp) lexer))
+ (unless (eq (aref string (1- (cadr tmp))) ?\))
+ (error "Unclosed group"))
+ (push (list 'group (car tmp)) res))
+ (`suffix
+ (if (null res) (error "Non-prefixed suffix operator")
+ (setq res (cons (list (cdr (assoc (substring string i (cadr tmp))
+ '(("*" . 0+)
+ ("+" . 1+)
+ ("?" . opt)
+ ("*?" . *\?)
+ ("+?" . +\?)
+ ("??" . \?\?))))
+ (car res))
+ (cdr res)))))
+ (`or (push `(or (seq ,@(nreverse res))) stack)
+ (setq res nil))
+ (`charset
+ (push (lex--parse-charset (substring string (1+ i) (1- (cadr tmp))))
+ res))
+ (`repeat
+ ;; Here we would like to have sub-matches :-(
+ (let* ((min (string-to-number
+ (substring string (+ i (if (eq (aref string i) ?\\) 2 1))
+ (cadr tmp))))
+ (max (let ((comma (string-match "," string i)))
+ (if (not (and comma (< comma (cadr tmp))))
+ min
+ (if (= comma (- (cadr tmp) 2))
+ nil
+ (string-to-number (substring string (1+
comma))))))))
+ (if (null res) (error "Non-prefixed repeat operator")
+ (setq res (cons `(repeat ,min ,max ,(car res)) (cdr res))))))
+ (`stop (setq op 'stop))
+ ((or `syntax `category `not-syntax `not-category)
+ (push (list (car tmp) (aref string (1- (cadr tmp)))) res))
+ (`backref
+ (push (list (car tmp) (- (aref string (1- (cadr tmp))) ?0)) res))
+ (`char
+ (push (aref string (1- (cadr tmp))) res))
+ (_ (push (car tmp) res)))
+ (setq i (cadr tmp)))
+ (let ((re `(seq ,@(nreverse res))))
+ (while stack (setq re (nconc (pop stack) (list re))))
+ (list re i))))
+
+(defun lex-parse-re (string &optional lexer)
+ (setq lexer (cond ((eq lexer 'ere) lex--parse-ere-lexer)
+ ((memq lexer '(bre re nil)) lex--parse-bre-lexer)
+ (t lexer)))
+ (let ((res (lex--parse-re string 0 lexer)))
+ (if (< (cadr res) (length string))
+ (error "Regexp parsing failed around %d: ...%s..."
+ (cadr res) (substring string (1- (cadr res)) (1+ (cadr res))))
+ (car res))))
+
+
+;; (defun lex--parse-re (string i)
+;; (let ((stack ())
+;; (op nil)
+;; (res nil))
+;; (while (and (not (eq op 'stop)))
+;; (lex-case string i
+;; ("(?:" ;shy-group.
+;; (let ((tmp (lex--parse-re string i)))
+;; (setq i (car tmp))
+;; (unless (eq (aref string (1- i)) ?\)) (error "Unclosed
shy-group"))
+;; (push (cdr tmp) res)))
+;; ((or "*?" "+?" "??")
+;; (error "Greediness control unsupported `%s'" (match-string 0
string)))
+;; ((or "*" "+" "?")
+;; (if (null res) (error "Non-prefixed suffix operator")
+;; (setq res (cons (list (cdr (assq (aref string (1- i))
+;; '((?* . 0+)
+;; (?+ . 1+)
+;; (?? . opt))))
+;; (car res))
+;; (cdr res)))))
+;; ("|" (push `(or (seq ,@(nreverse res))) stack)
+;; (setq res nil))
+;; ((seq "[" (opt "^") (opt "]")
+;; (0+ (or (seq (char not ?\]) "-" (char not ?\]))
+;; (seq "[:" (1+ (char (?a . ?z) (?A . ?Z))) ":]")
+;; (char not ?\]))) "]")
+;; (push (lex--parse-charset
+;; (substring string (1+ (match-beginning 0))
+;; (1- (match-end 0))))
+;; res))
+;; ((seq "{" (0+ (char (?0 . ?9)))
+;; (opt (seq "," (0+ (char (?0 . ?9))))) "}")
+;; ;; Here we would like to have sub-matches :-(
+;; (let* ((min (string-to-number (substring string
+;; (1+ (match-beginning 0))
+;; (match-end 0))))
+;; (max (let ((comma (string-match "," string (match-beginning
0))))
+;; (if (not (and comma (< comma (match-end 0))))
+;; min
+;; (if (= comma (- (match-end 0) 2))
+;; nil
+;; (string-to-number (substring string (1+
comma))))))))
+;; (if (null res) (error "Non-prefixed repeat operator")
+;; (setq res (cons `(repeat ,min ,max ,(car res)) (cdr res))))))
+;; ((or ")" eob) (setq op 'stop))
+;; ("\\'" (push 'eob res))
+;; ("\\`" (push 'bob res))
+;; ("^" (push 'bol res))
+;; ("$" (push 'eol res))
+;; ("." (push 'dot res))
+
+;; ((or "(" "\\<" "\\>" "\\_<" "\\_>" "\\c" "\\s" "\\C" "\\S" "\\w"
"\\W"
+;; "\\b" "\\B" "\\=" (seq "\\" (char (?1 . ?9))))
+;; (error "Unsupported construct `%s'" (match-string 0 string)))
+
+;; ((or (seq ?\\ anything) anything)
+;; (push (aref string (1- (match-end 0))) res))
+;; ("" (error "This should not be reachable"))))
+;; (let ((re `(seq ,@(nreverse res))))
+;; (while stack (setq re (nconc (pop stack) (list re))))
+;; (cons i re))))
+
+
+
+(provide 'lex)
+;;; lex.el ends here
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [ELPA-diffs] /srv/bzr/emacs/elpa r334: New package lex.el.,
Stefan Monnier <=