Re: JSON/YAML/TOML/etc. parsing performance

Eli Zaretskii <address@hidden> schrieb am Di., 19. Sep. 2017 um 21:10 Uhr:

> From: Philipp Stephani <address@hidden>
> Date: Tue, 19 Sep 2017 08:18:14 +0000
>
> Here's a newer version of the patch. The only significant difference is that now the Lisp values for JSON null
> and false are :null and :false, respectively. Using a dedicated symbol for :null reduces the mental overhead of
> the triple meaning of nil (null, false, empty list), and is more future-proof, should we ever want to support lists.

Thanks, a few comments below.

Thanks for the review. Most of the comments are about converting between C and Lisp strings, so let me summarize my questions here.

IIUC Jansson only accepts UTF-8 strings (i.e. it will generate an error some input is not an UTF-8 string), and will only return UTF-8 strings as well. Therefore I think that direct conversion between Lisp strings and C strings (using SDATA etc.) is always correct because the internal Emacs encoding is a superset of UTF-8. Also build_string should always be correct because it will generate a correct multibyte string for an UTF-8 string with non-ASCII characters, and a correct unibyte string for an ASCII string, right?

> +static _Noreturn void
> +json_parse_error (const json_error_t *error)
> +{
> + xsignal (Qjson_parse_error,
> + list5 (build_string (error->text), build_string (error->source),
> + make_natnum (error->line), make_natnum (error->column),
> + make_natnum (error->position)));
> +}

I think error->source could include non-ASCII characters, in which
case you need to use make_specified_string with its last argument
non-zero, not build_string, which has its own ideas about when to
produce a multibyte string.

> +static _GL_ARG_NONNULL ((2)) Lisp_Object
> +lisp_to_json_1 (Lisp_Object lisp, json_t **json)
> +{
> + if (VECTORP (lisp))
> + {
> + ptrdiff_t size = ASIZE (lisp);
> + eassert (size >= 0);
> + if (size > SIZE_MAX)
> + xsignal1 (Qoverflow_error, build_pure_c_string ("vector is too long"));

I don't think you can allocate pure storage at run time, only at dump
time. (There are more of this elsewhere in the patch.)

OK, will be fixed in the next version.

> + /* LISP now must be a vector or hashtable. */
> + if (++lisp_eval_depth > max_lisp_eval_depth)
> + xsignal0 (Qjson_object_too_deep);

This error could mislead: the problem could be in the nesting of
surrounding Lisp being too deep, and the JSON part could be just fine.

Agreed, but I think it's better to use lisp_eval_depth here because it's the total nesting depth that could cause stack overflows.

> + Lisp_Object string
> + = make_string (buffer_and_size->buffer, buffer_and_size->size);

This is arbitrary text, so I'm not sure make_string is appropriate.
Could the text be a byte stream, i.e. not human-readable text? If so,
do we want to create a unibyte string or a multibyte string here?

It should always be UTF-8.

> + insert_from_string (string, 0, 0, SCHARS (string), SBYTES (string), false);

Hmmm... if you want to insert the text into the buffer, you need to
make sure it has the right representation. What kind of text is this?
It probably should be decoded.

In any case, going through a string sounds gross. You should insert
the text directly into the gap, like we do in a couple of places
already. See insert_from_gap and its users, and maybe also
decode_coding_gap.

OK, I'll have to check that, but it sounds doable.

> +DEFUN ("json-parse-string", Fjson_parse_string, Sjson_parse_string, 1, 1, NULL,
> + doc: /* Parse the JSON STRING into a Lisp object.
> +This is essentially the reverse operation of `json-serialize', which
> +see. The returned object will be a vector or hashtable. Its elements
> +will be `:null', `:false', t, numbers, strings, or further vectors and
> +hashtables. If there are duplicate keys in an object, all but the
> +last one are ignored. If STRING doesn't contain a valid JSON object,
> +an error of type `json-parse-error' is signaled. */)
> + (Lisp_Object string)
> +{
> + ptrdiff_t count = SPECPDL_INDEX ();
> + check_string_without_embedded_nulls (string);
> +
> + json_error_t error;
> + json_t *object = json_loads (SSDATA (string), 0, &error);

Doesn't json_loads require the string to be encoded in some particular
encoding? If so, passing it our internal representation might not be
TRT.

> + /* First, parse from point to the gap or the end of the accessible
> + portion, whatever is closer. */
> + ptrdiff_t point = d->point;
> + ptrdiff_t end;
> + {
> + bool overflow = INT_ADD_WRAPV (BUFFER_CEILING_OF (point), 1, &end);
> + eassert (!overflow);
> + }
> + size_t count;
> + {
> + bool overflow = INT_SUBTRACT_WRAPV (end, point, &count);
> + eassert (!overflow);
> + }

Why did you need these blocks in braces?

To be able to reuse the "overflow" name/

> +(provide 'json-tests)
> +;;; json-tests.el ends here

IMO, it would be good to test also non-ASCII text in JSON objects.

Yes, once the patch is in acceptable shape, I plan to add many more tests.

From:	Philipp Stephani
Subject:	Re: JSON/YAML/TOML/etc. parsing performance
Date:	Thu, 28 Sep 2017 21:19:00 +0000