[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#33205: 26.1; unibyte/multibyte missing in rx.el
From: |
Eli Zaretskii |
Subject: |
bug#33205: 26.1; unibyte/multibyte missing in rx.el |
Date: |
Mon, 05 Nov 2018 18:49:07 +0200 |
> Date: Wed, 31 Oct 2018 17:55:08 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> Cc: 33205@debbugs.gnu.org
>
> > From: Mattias Engdegård <mattiase@acm.org>
> > Cc: 33205@debbugs.gnu.org
> > Date: Wed, 31 Oct 2018 16:27:53 +0100
> >
> > tis 2018-10-30 klockan 19:27 +0200 skrev Eli Zaretskii:
> > > I think it's a documentation bug: [:unibyte:] matches only ASCII
> > > characters. IOW, it tests "unibyteness" in the internal
> > > representation (which might be surprising, I know).
> > >
> > > And [:nonascii:] is only defined for multibyte characters.
> >
> > Thus [:ascii:]/[:nonascii:] cannot be distinguished from
> > [:unibyte:]/[:multibyte:]. Surely this cannot have been the intention?
>
> I actually looked into this some more, and I think my original
> conclusion was wrong. Let me dwell on that a bit more, and I will
> report what I found. We can then revisit the questions you ask above.
After looking into this, my conclusion is that what I wrote above was
not too wrong. Indeed, currently [:ascii:]/[:nonascii:] cannot be
distinguished from [:unibyte:]/[:multibyte:]. In a nutshell, it turns
out [:unibyte:] is not what one might think it is, you can see that in
re_wctype_to_bit, for example.
Thinking about this and looking at the code, I'd say that support of
named character classes is heavily biased in favor of multibyte text,
not to say supports _only_ multibyte text. So searching unibyte
strings and unibyte buffers for the likes of [:unibyte:] will only
find ASCII characters.
In multibyte buffers and strings, unibyte characters are stored in
their multibyte representation, so it is no longer trivial to define
what does [:unibyte:] mean. However, I discovered that there's a
workaround for what you are trying to do: use ^[:multibyte:] instead
of [:unibyte:]. Observe:
(setq s "A\310") => "A\310"
(string-match-p "A[[:ascii:]]" s) => nil
(string-match-p "A[[:nonascii:]]" s) => nil
(string-match-p "A[^[:ascii:]]" s) => 0 ;; !!!
(string-match-p "A[[:unibyte:]]" s) => nil
(string-match-p "A[^[:multibyte:]]" s) => 0 ;; !!!
That ^[:ascii:] is not the same as [:nonascii:], and the same with
[:unibyte:] vs ^[:multibyte:], is arguably a bug. The reason for that
becomes clear if you look at how we generate the fastmap in each of
these cases and how we set the bits in the work-area of the range
table, but I don't know enough to say how easy would it be to fix
that.
An alternative is to use an explicit character class, as in \000-\377,
that works as you'd expect.
> > Taking a step back: Do you agree that the missing unibyte/multibyte
> > should be added to rx
>
> I think it depends on what we find regarding the functionality. It's
> possible that it makes no real sense in the context of rx, for example
> (although it indeed sounds like an omission).
>
> > If there is a useful interpretation of [:unibyte:]/[:multibyte:] today,
> > perhaps we could make them behave that way.
>
> Right. Stay tuned, and thanks for pointing out this surprising
> behavior.
Well, what do you think now? Is it worth adding those to rx.el? I'm
not sure. How important is it to find unibyte characters in a string,
anyway?
- bug#33205: 26.1; unibyte/multibyte missing in rx.el,
Eli Zaretskii <=