bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `stri


From: Eli Zaretskii
Subject: bug#55777: [PATCH] Improve documentation of `string-to-multibyte', `string-to-unibyte'
Date: Sun, 05 Jun 2022 08:37:16 +0300

> Date: Sat, 4 Jun 2022 20:16:47 -0400
> Cc: 55777@debbugs.gnu.org
> From: Richard Hansen <rhansen@rhansen.org>
> 
> > You are digging into low-level details of how Emacs keeps strings in
> > memory, and the higher-level context of _why_ you need to understand
> > these details is left untold.
> 
> Readers either think the documentation is confusing or they don't; why
> they need to understand the documentation is mostly irrelevant. I
> find the documentation to be confusing, and I suspect I am not the
> only one.

I said "understand the details", not "understand the documentation".
The latter is a no-brainer: documentation should be understandable,
and I don't think what we have now isn't.  See below regarding the
parts you say confused you.

> > In general, Lisp programs are well advised to stay away of
> > manipulating unibyte strings, and definitely to refrain from comparing
> > unibyte and multibyte strings -- because these are supposed to be
> > never needed in Lisp applications, and because doing TRT with those
> > requires non-trivial knowledge of the Emacs internals.
> 
> I disagree with "well advised". The documentation in 34.1 and 34.3
> make it sound like the representation is merely an internal elisp
> implementation detail that programmers don't need to worry about,
> unless they are doing something unusually low-level.

That is exactly the intent.

The recommendation not to deal with non-text data directly (as opposed
via, say, packages like bindat.el) is based on experience, both mine
and that of others.

> I consider binary data processing to be somewhat common, not
> "unusually low-level". Yet manipulating byte values 128-255 in unibyte
> strings, and characters with Unicode codepoints 128-255 in multibyte
> strings, is fraught with peril. For example, it is risky to use `aref'
> to read a character or `aset' to write a character unless you either
> know the string representation or know that the character is not in
> #x80-#xff or #x3fff80-#x3fffff.

You are describing some of the known difficulties that arise when
manipulating binary data in Emacs strings and buffers, which are the
reasons for the above recommendation.  Emacs can do all this, but not
easily, since it isn't its main design goal.  For comparison, some
other text-processing environments simply reject any non-character
data in strings.

> > I see no reason to complicate the documentation for the very rare
> > occasions where these issues unfortunately leak to
> > higher-than-expected levels.
> 
> I don't think the occasions are all that rare.  But even if they are,
> the precise behavior should be documented somewhere so that
> programmers who need low-level string manipulation can do so
> correctly.

Documenting every aspect of the Emacs behavior for the rare chance
that someone some day will find it useful would make our documentation
too large.  The Emacs Lisp Reference manual already prints in 2 very
thick volumes.  So our policy is not to document the aspects that are
too obscure to be useful to many.

> I would argue that programmers using `string-to-unibyte'
> or `string-to-multibyte' fall into that category.

I disagree.  First, these functions should be used very rarely, and we
generally try to avoid them entirely.  And if they do need to be used,
the current documentation is IMO adequate.  It still has to be
understandable, of course, but it doesn't need to describe every
possible detail of how Emacs handles raw bytes and conversions between
them and readable text.

> I still find the current wording to be confusing. To me, all bytes
> have 8 bits so "raw 8-bit bytes" sounds bizarrely redundant. Also,
> ASCII characters are encoded to bytes, yet "raw 8-bit bytes" is meant
> to refer only to non-ASCII values.

What are "raw bytes" is explained in one of the previous sections of
this chapter.

> I have attached another revision that I think is complete, correct,
> and easier to understand.

I think it muddies the water by talking about numerical values 128 to
255, which also match some Latin characters.  It also removes the
reference to the codepoints Emacs uses to represent these bytes, which
is important in some situations.  So I think your proposal would
change this text for the worse.

Could you please state what is confusing in the current wording?  If
it's only the "raw 8-bit bytes" thing, it is explained earlier in the
manual; if needed, we could add a cross-reference there to that
section.  If it's something else, please tell.  But mentioning the
single-byte numerical values here actually increases the confusion,
IME, due to overlap with valid Unicode codepoints, which is why we
should and do deliberately refrain from doing that.

Thanks.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]