[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be publ
From: |
Eric Blake |
Subject: |
Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be public |
Date: |
Thu, 8 Aug 2024 09:02:23 -0500 |
User-agent: |
NeoMutt/20240425 |
On Thu, Aug 08, 2024 at 09:54:26AM GMT, Markus Armbruster wrote:
> Eric Blake <eblake@redhat.com> writes:
>
> > My next patch needs to convert text from an untrusted input into an
> > output representation that is suitable for display on a terminal is
> > useful to more than just the json-writer; the text should normally be
> > UTF-8, but blindly allowing all Unicode code points (including ASCII
> > ESC) through to a terminal risks remote-code-execution attacks on some
> > terminals. Extract the existing body of json-writer's quoted_strinto
> > a new helper routine mod_utf8_sanitize, and generalize it to also work
> > on data that is length-limited rather than NUL-terminated.
>
> This is two changes in one patch: factor out, and generalize.
>
> You don't actually use the generalized variant. Please leave that for
> later, and keep this patch simple.
Makes sense. Will simplify in v2.
>
> > [I was
> > actually surprised that glib does not have such a sanitizer already -
> > Google turns up lots of examples of rolling your own string
> > sanitizer.]
See https://gitlab.gnome.org/GNOME/glib/-/issues/3426 where I asked if
glib should provide a more generic sanitizer. In turn, I found glib
does have:
char*
g_uri_escape_string (
const char* unescaped,
const char* reserved_chars_allowed,
gboolean allow_utf8
)
which is a bit more powerful (you have some fine-tuning on what gets
escaped), but a different escaping mechanism (%XX instead of \uXXXX
escapes, where % rather than \ is special). I'm not sure whether it
is nicer to have a malloc'd result or to append into a larger g_string
as the base operation (and where you could write an easy wrapper to
provide the other operation).
> > +/**
> > + * mod_utf8_sanitize:
> > + * @buf: Destination buffer
> > + * @str: Modified UTF-8 string to sanitize
> > + *
> > + * Append the contents of the NUL-terminated Modified UTF-8 string
> > + * @str into @buf, with escape sequences used for non-printable ASCII
>
> "Append into" or "append to"?
"append to" works, will simplify.
>
> > + * and for quote characters. The result is therefore safe for output
> > + * to a terminal.
>
> Actually, we escape double quote, backslash, ASCII control characters,
> and non-ASCII characters, i.e. we leave just printable ASCII characters
> other than '"' and '\\' unescaped. See the code quoted below.
>
> Escaping '\\' is of course necessary.
>
> Escaping '"' is necessary only for callers that want to enclose the
> result in double quotes without ambiguity. Which is what the existing
> caller wants. Future callers may prefer not to escape '"'. But we can
> worry about this when we run into such callers.
If we encounter more users, I could see this morphing into a backend
that takes a flag argument on knobs like whether to escape " or ',
whether to preserve printing Unicode, ...; coupled with frontends with
fewer arguments that pass the right flags intot the backend.
>
> > + *
> > + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> > + * "\xC0\x80".
> > + */
>
> Missing: behavior for invalid sequences. Each such sequence is replaced
> by a replacement character U+FFFD. For the definition of "invalid
> sequence", see mod_utf8_codepoint().
Indeed, more documentation here wouldn't hurt (both on \ and " being
escaped, and on U+FFFD codepoints being placed into the output
stream).
>
> On the function name. "Sanitize" could be many things. This function
> actually does two things: (1) replace invalid sequences, and (2) escape
> to printable ASCII. What about append_mod_utf8_as_printable_ascii()?
> Admittedly a mouthful.
Naming is always tough. Your suggestion is longer, but indeed
accurate. Maybe a shorter compromise of append_escaped_mod_utf8()?
>
> > +void mod_utf8_sanitize(GString *buf, const char *str)
> > +{
> > + mod_utf8_sanitize_len(buf, str, -1);
> > +}
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org