[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] check-qjson: More thorough testing of UTF-8 in
From: |
Blue Swirl |
Subject: |
Re: [Qemu-devel] [PATCH] check-qjson: More thorough testing of UTF-8 in strings |
Date: |
Thu, 28 Feb 2013 20:06:01 +0000 |
On Thu, Feb 28, 2013 at 7:42 PM, Markus Armbruster <address@hidden> wrote:
> Blue Swirl <address@hidden> writes:
>
>> On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <address@hidden> wrote:
>>> Blue Swirl <address@hidden> writes:
>>>
>>>> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <address@hidden> wrote:
>>>>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and
>>>>> stress test at
>>>>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>>>
>>>>> Unfortunately, both JSON parser and formatter misbehave right now.
>>>>> This test expects current, incorrect results. They're all clearly
>>>>> marked, and are to be replaced by correct ones as the bugs get fixed.
>>>>> See comments in new utf8_string() for details.
>>>>>
>>>>> Signed-off-by: Markus Armbruster <address@hidden>
>>>>> ---
>>>>> tests/check-qjson.c | 625
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 625 insertions(+)
>>>>>
>>>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>>>> index 32ffb43..4590b3a 100644
>>>>> --- a/tests/check-qjson.c
>>>>> +++ b/tests/check-qjson.c
>>>>> @@ -1,8 +1,10 @@
>>>>> /*
>>>>> * Copyright IBM, Corp. 2009
>>>>> + * Copyright (c) 2013 Red Hat Inc.
>>>>> *
>>>>> * Authors:
>>>>> * Anthony Liguori <address@hidden>
>>>>> + * Markus Armbruster <address@hidden>,
>>>>> *
>>>>> * This work is licensed under the terms of the GNU LGPL, version 2.1 or
>>>>> later.
>>>>> * See the COPYING.LIB file in the top-level directory.
>>>>> @@ -131,6 +133,628 @@ static void single_quote_string(void)
>>>>> }
>>>>> }
>>>>>
>>>>> +static void utf8_string(void)
>>>>> +{
>>>>> + /*
>>>>> + * FIXME Current behavior for invalid UTF-8 sequences is
>>>>> + * incorrect. This test expects current, incorrect results.
>>>>> + * They're all marked "bug:" below, and are to be replaced by
>>>>> + * correct ones as the bugs get fixed.
>>>>> + *
>>>>> + * The JSON parser rejects some invalid sequences, but accepts
>>>>> + * others without correcting the problem.
>>>>> + *
>>>>> + * The JSON formatter replaces some invalid sequences by U+FFFF (a
>>>>> + * noncharacter), and goes wonky for others.
>>>>> + *
>>>>> + * For both directions, we should either reject all invalid
>>>>> + * sequences, or minimize overlong sequences and replace all other
>>>>> + * invalid sequences by a suitable replacement character. A
>>>>> + * common choice for replacement is U+FFFD.
>>>>> + *
>>>>> + * Problem: we can't easily deal with embedded U+0000. Parsing
>>>>> + * the JSON string "this \\u0000" is fun" yields "this \0 is fun",
>>>>> + * which gets misinterpreted as NUL-terminated "this ". We should
>>>>> + * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>>>> + * UTF-8").
>>>>> + *
>>>>> + * Tests are scraped from Markus Kuhn's UTF-8 decoder capability
>>>>> + * and stress test at
>>>>> + * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>>> + */
>>>>> + static const struct {
>>>>> + const char *json_in;
>>>>> + const char *utf8_out;
>>>>> + const char *json_out; /* defaults to @json_in */
>>>>> + const char *utf8_in; /* defaults to @utf8_out */
>>>>> + } test_cases[] = {
>>>>> + /*
>>>>> + * Bug markers used here:
>>>>> + * - bug: not corrected
>>>>> + * JSON parser fails to correct invalid sequence(s)
>>>>> + * - bug: rejected
>>>>> + * JSON parser rejects invalid sequence(s)
>>>>> + * We may choose to define this as feature
>>>>> + * - bug: want "\"...\""
>>>>> + * JSON formatter produces incorrect result, this is the
>>>>> + * correct one, assuming replacement character U+FFFF
>>>>> + * - bug: want "..." (no \")
>>>>> + * JSON parser produces incorrect result, this is the
>>>>> + * correct one.
>>>>> + * Not marked explicitly, but trivial to find:
>>>>> + * - JSON formatter replacing invalid sequence by \\uFFFF is a
>>>>> + * bug if we want it to fail for invalid sequences.
>>>>> + */
>>> [...]
>>>>> + /* 2.1.4 4 bytes U+10000 */
>>>>> + {
>>>>> + "\"\xF0\x90\x80\x80\"",
>>>>> + "\xF0\x90\x80\x80",
>>>>> + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
>>>>> + },
>>> [...]
>>>>> + {}
>>>>> + };
>>>>> + int i;
>>>>> + QObject *obj;
>>>>> + QString *str;
>>>>> + const char *json_in, *utf8_out, *utf8_in, *json_out;
>>>>> +
>>>>> + for (i = 0; test_cases[i].json_in; i++) {
>>>>> + json_in = test_cases[i].json_in;
>>>>> + utf8_out = test_cases[i].utf8_out;
>>>>> + utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out;
>>>>> + json_out = test_cases[i].json_out ?: test_cases[i].json_in;
>>>>> +
>>>>> + obj = qobject_from_json(json_in);
>>>>> + if (utf8_out) {
>>>>> + g_assert(obj);
>>>>> + g_assert(qobject_type(obj) == QTYPE_QSTRING);
>>>>> + str = qobject_to_qstring(obj);
>>>>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out);
>>>>> + } else {
>>>>> + g_assert(!obj);
>>>>> + }
>>>>> + qobject_decref(obj);
>>>>> +
>>>>> + obj = QOBJECT(qstring_from_str(utf8_in));
>>>>> + str = qobject_to_json(obj);
>>>>> + if (json_out) {
>>>>> + g_assert(str);
>>>>> + g_assert_cmpstr(qstring_get_str(str), ==, json_out);
>>>>
>>>> This assertion trips on an ARM host (Debian stable):
>>>>
>>>> GTESTER tests/check-qjson
>>>> **
>>>> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
>>>> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
>>>> "\"\\u0400\\uFFFF\"")
>>>> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd
>>>
>>> Must be case 2.1.4, because that's where json_out is
>>> "\"\\u0400\\uFFFF\"".
>>>
>>> We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser
>>> qobject_from_json(). Yields "\xF0\x90\x80\x80", as expected.
>>>
>>> We then pass that to qobject_to_json(). Should yield
>>> "\"\\uD800\\uDC00\"" (surrogate pair). Does yield "\"\\u0400\\uFFFF\""
>>> on my machine (known bug), and "\"\\u0400\200\"" on yours.
>>>
>>> Looks like the JSON formatter is not just broken (we knew that already),
>>> it's broken in machine-dependent ways. Good to know, thanks for
>>> reporting.
>>>
>>> Obvious ways to get "make check" pass for you again *now*:
>>>
>>> * Disable check-qjson. That's too big a hammer for me.
>>>
>>> * Disable test case 2.1.4 with a comment explaining why.
>>>
>>> * Suitable #ifdeffery around the expected value.
>>>
>>> Preferences?
>>
>> * Fix JSON formatter :-)
>
> I want that too, but I'm afraid we can't have it *now* :)
>
>> Disabling 2.1.4 only reveals the next problem:
>> GTESTER tests/check-qjson
>> GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c
>> **
>> ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed
>> (qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" ==
>> "\"\\u8200\\uFFFF\\uFFFF\"")
>
> All right, I give up. I can't fix to_json() tonight (I have maybe 30
> minutes of usable brain left), but I can make it portably wrong. Please
> try the appended patch.
GTESTER tests/check-qjson
GTester: last random seed: R02Scc9b8a0a880770aaee720c8f98cc953d
**
ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
(qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
"\"\\u0400\\uFFFF\"")
>
>
> diff --git a/qobject/qjson.c b/qobject/qjson.c
> index 83a6b4f..195da1f 100644
> --- a/qobject/qjson.c
> +++ b/qobject/qjson.c
> @@ -187,7 +187,21 @@ static void to_json(const QObject *obj, QString *str,
> int pretty, int indent)
> default: {
> if (ptr[0] <= 0x1F) {
> char escape[7];
> - snprintf(escape, sizeof(escape), "\\u%04X", ptr[0]);
> + /*
> + * Portability band-aid
> + *
> + * We used to print ptr[0] here, but when
> + * plain char is signed, that prints \uFFFF
> + * for negative values. The code here is crap
> + * (see utf8_string() in tests/check-qjson.c),
> + * and needs to be fixed. Can't do that right
> + * now, and don't want to go from wrong to
> + * differently wrong, so I make the wrong we
> + * now get on the most common machines the
> + * wrong we get on all machines.
> + */
> + snprintf(escape, sizeof(escape), "\\u%04X",
> + *(signed char *)ptr);
> qstring_append(str, escape);
> } else {
> char buf[2] = { ptr[0], 0 };