bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24206: 25.1; Curly quotes generate invalid strings, leading to a seg


From: Paul Eggert
Subject: bug#24206: 25.1; Curly quotes generate invalid strings, leading to a segfault
Date: Thu, 18 Aug 2016 11:33:36 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

Eli Zaretskii wrote:

The code that got removed was the easy and intuitive part: it dealt
with processing single-byte strings one byte at a time.  The
hard-to-read part of the code is still with us.  We have less 'if'
conditionals, but that's hardly the main complication in the original
code.

Sure, but removing unnecessary easy stuff lets the reader see the hard stuff more clearly.

You are missing my point: the code on master now processes a string,
that could be either unibyte or multibyte, using only multibyte
methods.  With the flag in place, each kind of string would have used
the method that's natural with it.  The way things are now, one has to
think hard about what the code does to convince oneself it's valid.

The way things were before it was even harder, because one had to worry not only about processing Emacs-encoded text, one also had to worry about processing unibyte text containing non-ASCII bytes. The code is simpler now, because it needs only to process Emacs-encoded text.

The old code might have flown despite its problems, if all the input data were consistent (i.e., either all unibyte, or all with Emacs-encoded text). But inputs need not be consistent, so the old approach simply did not work.

As I take it, your principal objection to the new code is not to its internals: it's that substitute-command-keys can now return a multibyte string even when all the input data is unibyte. I don't think that's a big deal, but if this is the primary reason for our lengthy conversation, I can move things forward by changing the code so that it instead returns a unibyte string when all the input data are unibyte. Would that suffice?

    I don't see why it is tricky, we do that in Emacs in other places.

Really? A call to STRING_CHAR_AND_LENGTH followed by a length test followed by a 
call to memcpy for length > 1 and a special case inline copy for length == 1? 
When copying multibyte data? Where else does Emacs do that?

What exactly confuses you in that snippet?

Nothing confuses me in that snippet. I know what the snippet does, now that I've read and understood it. It is a longwinded and unnecessarily tricky way of doing something simple.

The call to
STRING_CHAR_AND_LENGTH itself? we have that in umpteen other places.
The single-byte optimization of not calling memcpy?  That's standard
practice in C.

I'm not talking about each individual line of code in that snippet. I am talking about the entire construction. A reader must look at all 14 lines to deduce what it does, deduce that it's unnecessarily complicated, and deduce that the unnecessary complication is not a sign that something unusual is going on. No place else in Emacs has this construction.

If you need an example for using
STRING_CHAR_AND_LENGTH while copying text, you can find it in
copy_text, for example.

copy_text does something quite different. When copying multibyte text, it does a single memcpy for the entire string. copy_text does not call memcpy for each multibyte character, and nothing in copy_text is particularly close to the snippet in question.

you replaced it with a
fall-through, which is harder to follow and easier to introduce subtle
bugs.

True, but the fall-through is clearly marked in comments. The new way is a bit more efficent (smaller code space, more likely to fit in cache); the old way duplicated character-copying code, a practice that also introduces subtle bugs -- in addition to temporarily mystifying the reader because the old code duplicates were not identical (itself a sign of cruft). With all that in mind, it was reasonable to switch from the old way to the new, despite the disadvantage you mention.

        Alan wanted something that he could put into his .emacs that would cause
        (message PERCENTLESS) to output the string PERCENTLESS as-is, assuming
        PERCENTLESS lacks %. This was the point of his original bug report; his 
original
        example involved ` and ' but he wanted the same behavior for ‘ and ’, a 
point
        that became clear during the discussion of Bug#23425.

    Then why not for '..' as well?  How is that different from ‘..’?

It's not different. Alan wanted the same behavior for '..', and he got that too.

But the behavior is not the same:

I was referring to Alan's desire to treat all quotes the same (i.e., to not substitute for any of them), which is now supported by setting text-quoting-style to grave.

  (let ((text-quoting-style 'curve))
    (substitute-command-keys "'foo'"))
      => ’foo’

but

  (let ((text-quoting-style 'grave))
    (substitute-command-keys "‘foo’"))
      => ‘foo’

I would have expected the first example to yield 'foo'

No, substitute-command-keys works on each grave accent and apostrophe separately, without looking at the others. As I recall it's worked that way and has been documented that way, in both master and emacs-25, ever since the feature was installed. One could posit a "smarter" form of substitution, which leaves 'foo' alone but which translates `foo'. Although we considered that possibility during design, we rejected it because it is more complicated and has more problems and quirks that are a pain to document and would surprise users in other ways.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]