emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Char-folding: how can we implement matching multiple characters as a sin


From: Artur Malabarba
Subject: Char-folding: how can we implement matching multiple characters as a single "thing"?
Date: Mon, 30 Nov 2015 15:54:50 +0000

I pushed this feature a couple of days ago, but I had to disable it
now (the code is in `character-fold-to-regexp').

For those who don't know, the way char-folding works is by replacing
each character in the search string with a regexp of the characters it
can match.
For instance, a character-folding search for "fi" is actually a regexp
search for 
"\\(?:ḟ\\|[fᶠḟⓕf𝐟𝑓𝒇𝒻𝓯𝔣𝕗𝖋𝖿𝗳𝘧𝙛𝚏]\\)\\(?:i[̀-̨̣̰̄̆̈̉̌̏̑]\\|[iì-ïĩīĭįǐȉȋᵢḭỉịⁱℹⅈⅰⓘi𝐢𝑖𝒊𝒾𝓲𝔦𝕚𝖎𝗂𝗶𝘪𝙞𝚒]\\)".

As you can see, this multiplies the length of the string by a factor
of about 45. But that's still acceptable.

However, this only includes foldings for the 'f' character and the 'i'
character independently. As it happens, the string "fi" can also match
the ligature 'fi'.
Now, for the sake of conciseness, let me denote the f and i regexps
above as [f] and [i].

The code I pushed a couple of days ago added support for multi-char
matches (such as "fi" matching that ligature), but returning a regexp
like this: "\\([f][i]\\|fi\\)"
This might not look like much. It only adds a few chars to a regexp
that already has about 90. But the problem is that it introduces a new
branch.

Let's say we want to search for the word "fix". The letters 'i' and
'x' could also represent the character 'ⅸ' (roman numeral nine). So
the only way to write that regexp is:
"\\([f]\\([i][x]\\|ⅸ\\)\\|fi[x]\\)".
Now the pattern [x] (which I remind you has ~ 40 chars) appears twice
in the regexp, and we've gained another branch. If the next character
the search string can also combine with 'x', then we go from 3 to 5
branches.

As you can see. The number of branches will scale badly with the
number of input characters. Which means the number of output
characters will be much more than 40 times the input.
And even if the next character does NOT combine with 'x', it's hard to
write an algorithm that can identify that and close all branches
before continuing with the conversion.

And that is why I disabled this code for now. With a short string like
"endless-visit-notifications" it was return regexps longer than 10k
characters. Which is more than Emacs handle.

Does anyone have alternative ideas?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]