emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: JSON/YAML/TOML/etc. parsing performance


From: Philipp Stephani
Subject: Re: JSON/YAML/TOML/etc. parsing performance
Date: Tue, 03 Oct 2017 15:52:03 +0000



Eli Zaretskii <address@hidden> schrieb am Di., 3. Okt. 2017 um 17:32 Uhr:

> > > +static Lisp_Object
> > > +internal_catch_all_1 (Lisp_Object (*function) (void *), void *argument)
> >
> > Can you tell why you needed this (and the similar internal_catch_all)?
> > Is that only because the callbacks could signal an error, or is there
> > another reason?  If the former, I'd prefer to simplify the code and
> > its maintenance by treating the error condition in a less drastic
> > manner, and avoiding the call to xsignal.
>
> The callbacks (especially insert and before-/after-change-hook) can exit
> nonlocally, but these nonlocal exits may not escape the Jansson callback.
> Therefore all nonlocal exits must be caught here.

Why can't you use record_unwind_protect, as we normally do in these
situations?

How would that help? record_unwind_protect can't stop nonlocal exits.
 

> > > +static _GL_ARG_NONNULL ((2)) Lisp_Object
> > > +lisp_to_json_toplevel_1 (Lisp_Object lisp, json_t **json)
> > > +{
> > > +  if (VECTORP (lisp))
> > > +    {
> > > +      ptrdiff_t size = ASIZE (lisp);
> > > +      eassert (size >= 0);
> > > +      if (size > SIZE_MAX)
> > > +        xsignal1 (Qoverflow_error, build_string ("vector is too long"));
> >
> > I think this error text is too vague.  Can we come up with something
> > that describes the problem more accurately?
>
> Maybe, but it's probably not worth it because I don't think we have many
> architectures where PTRDIFF_MAX > SIZE_MAX.

Then why do we punish all the platforms with this runtime check?

If you think this cannot happen we can turn it into a runtime or compile-time assertion.
 

> > And btw, how can size be greater than SIZE_MAX in this case?  This is
> > a valid Lisp object, isn't it?  (There are more such tests in the
> > patch, e.g. in lisp_to_json, and I think they, too, are redundant.)
>
> Depends on the range of ptrdiff_t and size_t. IIUC nothing in the C
> standard guarantees PTRDIFF_MAX <= SIZE_MAX.

I wasn't talking about PTRDIFF_MAX, I was talking about 'size', which
is the number of bytes in a Lisp string.  Since that Lisp string is a
valid Lisp object, how can its size be greater than SIZE_MAX?  I don't
think there's a way of creating such a Lisp string in Emacs, because
functions that allocate memory for strings will prevent that.

Then I think we should at least add an assertion to document this.
 

> > > +      *json = json_check (json_array ());
> > > +      ptrdiff_t count = SPECPDL_INDEX ();
> > > +      record_unwind_protect_ptr (json_release_object, json);
> > > +      for (ptrdiff_t i = 0; i < size; ++i)
> > > +        {
> > > +          int status
> > > +            = json_array_append_new (*json, lisp_to_json (AREF (lisp,
> > i)));
> > > +          if (status == -1)
> > > +            json_out_of_memory ();
> > > +          eassert (status == 0);
> > > +        }
> > > +      eassert (json_array_size (*json) == size);
> > > +      clear_unwind_protect (count);
> > > +      return unbind_to (count, Qnil);
> >
> > This, too, sounds more complex than it should: you record
> > unwind-protect just so lisp_to_json's subroutines could signal an
> > error due to insufficient memory, right?  Why can't we have the
> > out-of-memory check only inside this loop, which you already do, and
> > avoid the checks on lower levels (which undoubtedly cost us extra
> > cycles)?  What do those extra checks in json_check buy us? the errors
> > they signal are no more informative than the one in the loop, AFAICT.
>
> I don't understand what you mean. We need to check the return values of all
> functions if we want to to use them later.

Yes, but what problems can cause these return value to be invalid?
AFAICT, only out-of-memory conditions, and that can be checked only
once, there's no need to check every single allocation, because once
an allocation fails, all the rest will too.

But if the first succeeds, the second can still fail, so we do need to check all of them.
 

> > > +static Lisp_Object
> > > +json_insert (void *data)
> > > +{
> > > +  const struct json_buffer_and_size *buffer_and_size = data;
> > > +  if (buffer_and_size->size > PTRDIFF_MAX)
> > > +    xsignal1 (Qoverflow_error, build_string ("buffer too large"));
> > > +  insert (buffer_and_size->buffer, buffer_and_size->size);
> >
> > I don't think we need this test here, as 'insert' already has the
> > equivalent test in one of its subroutines.
>
> It can't, because it takes the byte length as ptrdiff_t. We need to check
> before whether the size is actually in the valid range of ptrdiff_t.

I'm sorry, but I don't see why we should support such exotic
situations only for this one feature.  In all other cases we use
either ptrdiff_t type or EMACS_INT type, and these issues disappear
then.  Trying to support the SIZE_MAX > PTRDIFF_MAX situation causes
the code to be much more complicated, harder to maintain, and more
expensive at run time than it should be. 

We can't avoid these checks. The API returns size_t, so we can only assume that the numbers are in the valid range of size_t, which is larger than the ones for positive ptrdiff_t's. There's no way around this.
 
I'm not even sure there are
such platforms out there that Emacs supports,

All platforms that I know of have SIZE_MAX > PTRDIFF_MAX.
 
but if there are, we
already have a gazillion problems like that all over our code. 

Just because other parts of the codebase are buggy doesn't mean we need to introduce more bugs in new code.
 
I
object to having such code just for this reason, sorry.

We can't avoid it.
 

> > > +    case JSON_INTEGER:
> > > +      {
> > > +        json_int_t value = json_integer_value (json);
> > > +        if (FIXNUM_OVERFLOW_P (value))
> > > +          xsignal1 (Qoverflow_error,
> > > +                    build_string ("JSON integer is too large"));
> > > +        return make_number (value);
> >
> > This overflow test is also redundant, as make_number already does it.
>
> It can't, because json_int_t can be larger than EMACS_INT.

OK, but then I think we should consider returning a float value, or a
cons of 2 integers.  If these situations are frequent enough, users
will thank us, and if they are very infrequent, they will never see
such values, and we gain code simplicity and less non-local exits.

Returning a float (using make_natnum_or_float) might work, but in the end I've decided against it because it could silently drop precision. I think that's worse than signaling an error.
 

> > > +    case JSON_STRING:
> > > +      {
> > > +        size_t size = json_string_length (json);
> > > +        if (FIXNUM_OVERFLOW_P (size))
> > > +          xsignal1 (Qoverflow_error, build_string ("JSON string is too
> > long"));
> > > +        return json_make_string (json_string_value (json), size);
> >
> > Once again, the overflow test is redundant, as make_specified_string
> > (called by json_make_string) already includes an equivalent test.
>
> And once again, we need to check at least whether the size fits into
> ptrdiff_t.

No, we don't, as we don't in other similar cases.

I don't understand why you think these checks aren't necessary. Converting between integral types when the number is out of range for the destination type results in an implementation-defined result, i.e. it's unportable. Even assuming the GCC convention, performing such conversions results in dangerously incorrect values.
 

> > > +    case JSON_ARRAY:
> > > +      {
> > > +        if (++lisp_eval_depth > max_lisp_eval_depth)
> > > +          xsignal0 (Qjson_object_too_deep);
> > > +        size_t size = json_array_size (json);
> > > +        if (FIXNUM_OVERFLOW_P (size))
> > > +          xsignal1 (Qoverflow_error, build_string ("JSON array is too
> > long"));
> > > +        Lisp_Object result = Fmake_vector (make_natnum (size),
> > Qunbound);
> >
> > Likewise here: Fmake_vector makes sure the size is not larger than
> > allowed.
>
> Same as above: It can't.

It can and it does.

No, it can't. make_natnum takes a ptrdiff_t argument, and when passing a value that's out of range for ptrdiff_t, it will receive an incorrect, implementation-defined value.
 

> > > +    case JSON_OBJECT:
> > > +      {
> > > +        if (++lisp_eval_depth > max_lisp_eval_depth)
> > > +          xsignal0 (Qjson_object_too_deep);
> > > +        size_t size = json_object_size (json);
> > > +        if (FIXNUM_OVERFLOW_P (size))
> > > +          xsignal1 (Qoverflow_error,
> > > +                    build_string ("JSON object has too many elements"));
> > > +        Lisp_Object result = CALLN (Fmake_hash_table, QCtest, Qequal,
> > > +                                    QCsize, make_natnum (size));
> >
> > Likewise here: make_natnum does the equivalent test.
>
> It doesn't and can't.

Yes, it does:

  INLINE Lisp_Object
  make_natnum (EMACS_INT n)
  {
    eassert (0 <= n && n <= MOST_POSITIVE_FIXNUM);  <<<<<<<<<<<<<<<
    EMACS_INT int0 = Lisp_Int0;

We're not talking about the same thing. What if make_natnum is called with a value that doesn't fit in EMACS_INT?
Also an assertion is incorrect here because the overflowing value comes from user data.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]