bug#74922: Fwd: bug#74922: 29.4; copy_string

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always pro

From:	Eli Zaretskii
Subject:	bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
Date:	Sat, 21 Dec 2024 14:09:24 +0200

> Cc: 74922@debbugs.gnu.org
> Date: Tue, 17 Dec 2024 17:10:36 +0200
> From: Eli Zaretskii <eliz@gnu.org>
> 
> > From: Evgeny Kurnevsky <kurnevsky@gmail.com>
> > Date: Tue, 17 Dec 2024 14:46:28 +0000
> > Cc: 74922@debbugs.gnu.org
> > 
> > It can definitely do it, but I guess in emacs-module-rs it's not done by 
> > default because of performance
> > implications - it might be quite costly to check every string in some 
> > cases, and it wasn't really clear if emacs
> > can pass an invalid string. So currently this case causes undefined 
> > behavior there which results in emacs
> > crash.
> 
> What do Rust programs do when they are told to read random files?
> This is the same situation, basically.
> 
> And what would the module do if copy_string_contents *did* signal an
> error?

I think I know what happened: you called copy_string_contents with a
unibyte string.  In that case, copy_string_contents will return you
the original string without doing anything.  The code in
copy_string_contents that signals an error relies on the fact that
encoding the input string yields nil if the input includes non-Unicode
characters. But that cannot be established with unibyte strings,
because a unibyte string doesn't hold characters, it holds raw bytes.

What you should do is make sure the string passed to
copy_string_contents is a multibyte string.  If I do that, i.e.

  (switch-to-buffer "foo")
  (set-buffer-multibyte t)
  (insert-file-contents "/path/to/wg-private-pc.age")
  (setq str1 (buffer-string))

and then call copy_string_contents with the resulting string str1, I
get the result you expected.

You need to realize that copy_string_contents is a variant of
text-encoding routines: it encodes the input multibyte string in
UTF-8.  The encoding routines in Emacs always return unibyte strings
without doing anything, because a unibyte string is already encoded,
or at least is supposed to be encoded.

And before you ask: no, copy_string_contents cannot by itself signal
an error if passed a unibyte string, because a unibyte string can
legitimately be a valid UTF-8 string. So in this case,
copy_string_contents relies on the caller to make sure the input is
valid UTF-8.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Evgeny Kurnevsky, 2024/12/17
- bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Eli Zaretskii, 2024/12/17
  - Message not available
    - bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Evgeny Kurnevsky, 2024/12/17
    - bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Eli Zaretskii, 2024/12/17
    - bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Evgeny Kurnevsky, 2024/12/17
    - bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Eli Zaretskii, 2024/12/17
    - bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8, Eli Zaretskii <=

Prev by Date: bug#74965: [PATCH] Document representation of dates in calendar.el
Next by Date: bug#74966: 31.0.50; Crash report (using igc on macOS)
Previous by thread: bug#74922: Fwd: bug#74922: 29.4; copy_string_contents doesn't always produce a valid utf-8
Next by thread: bug#74924: 29.3; Buffer showing manpage jumps back to beginning
Index(es):
- Date
- Thread