Re: [Qemu-devel] [PATCH 1/8] cutils: unsigned int parsing functions

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/8] cutils: unsigned int parsing functions

From:	Markus Armbruster
Subject:	Re: [Qemu-devel] [PATCH 1/8] cutils: unsigned int parsing functions
Date:	Fri, 18 Jan 2013 19:06:28 +0100
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.1 (gnu/linux)

Eduardo Habkost <address@hidden> writes:

> On Fri, Jan 18, 2013 at 11:01:14AM +0100, Markus Armbruster wrote:
>> Eduardo Habkost <address@hidden> writes:
>> 
>> > There are lots of duplicate parsing code using strto*() in QEMU, and
>> > most of that code is broken in one way or another. Even the visitors
>> > code have duplicate integer parsing code[1]. This introduces functions
>> > to help parsing unsigned int values: parse_uint() and parse_uint_full().
>> >
>> > Parsing functions for signed ints and floats will be submitted later.
>> >
>> > parse_uint_full() has all the checks made by opts_type_uint64() at
>> > opts-visitor.c:
>> >
>> >  - Check for NULL (returns -EINVAL)
>> >  - Check for negative numbers (returns -ERANGE)
>> >  - Check for empty string (returns -EINVAL)
>> >  - Check for overflow or other errno values set by strtoll() (returns
>> >    -errno)
>> >  - Check for end of string (reject invalid characters after number)
>> >    (returns -EINVAL)
>> >
>> > parse_uint() does everything above except checking for the end of the
>> > string, so callers can continue parsing the remainder of string after
>> > the number.
>> >
>> > Unit tests included.
>> >
>> > [1] string-input-visitor.c:parse_int() could use the same parsing code
>> >     used by opts-visitor.c:opts_type_int(), instead of duplicating that
>> >     logic.
>> [...]
>> > diff --git a/util/cutils.c b/util/cutils.c
>> > index 80bb1dc..7f09251 100644
>> > --- a/util/cutils.c
>> > +++ b/util/cutils.c
>> > @@ -270,6 +270,85 @@ int64_t strtosz(const char *nptr, char **end)
>> >      return strtosz_suffix(nptr, end, STRTOSZ_DEFSUFFIX_MB);
>> >  }
>> >  
>> > +/* Try to parse an unsigned integer
>> > + *
>> > + * Error checks done by the function:
>> > + * - NULL pointer will return -EINVAL.
>> > + * - Empty strings will return -EINVAL.
>> 
>> Same for strings containing only whitespace.
>
> Oh, you're right.
>
>> 
>> > + * - Overflow errors or other errno values  set by strtoull() will
>> 
>> Extra space before 'set'.
>
> Oops.
>
>> 
>> > + *   return -errno (-ERANGE in case of overflow).
>> > + * - Differently from strtoull(), values starting with a minus sign are
>> > + *   rejected (returning -ERANGE).
>> > + *
>> > + * Sets endptr to point to the first invalid character.
>> 
>> Mention that unlike strtol() & friends, endptr must not be null?
>> Probably not necessary.
>
> I believe it is implicit if I document it as "Sets *endptr". But worth
> documenting as it is different from strtoull() behavior.
>
>> 
>> The two ERANGE cases treat endptr differently: it's either set to point
>> just behind the out-of-range number, or to point to the beginning of the
>> out-of-range number.  Ugh.
>
> This should be fixed in the newer version I sent.
>
>> 
>> > +                                                        Callers may rely
>> > + * on *value and *endptr to be always set by the function, even in case of
>> > + * errors.
>> 
>> You neglect to specify what gets assigned to *value in error cases.
>> 
>
> Undefined.  :-)
>
> Anyway, I agree it not very useful to document that "*value and *endptr
> will be always set" if it is undefined. I will try to reword it.

Yes, please.

I'm not fond of undefined behavior, unless there's a good reason.

If you don't want to assign a defined value, consider not assigning
anything.

>> Your list of error checks isn't quite complete.  Here's my try:
>> 
>> /**
>>  * Parse unsigned integer
>>  *
>>  * @s: String to parse
>>  * @value: Destination for parsed integer value
>>  * @endptr: Destination for pointer to first character not consumed
>>  * @base: integer base, between 2 and 36 inclusive, or 0
>>  *
>>  * Parsed syntax is just like strol()'s: arbitrary whitespace, a single
>>  * optional '+' or '-', an optional "0x" if @base is 0 or 16, one or
>>  * more digits.
>
> The newer version has '-' as _not_ part of the syntax.
>
>>  *
>>  * If @s is null, or @base is invalid, or @s doesn't start with an
>>  * integer in the syntax above, set address@hidden to 0, address@hidden to 
>> @s, and
>>  * return -EINVAL.
>
> This matches the behavior of the newer version. But I would like to keep
> *value documented as undefined in case of errors.
>
>>  *
>>  * Set @endptr to point right beyond the parsed integer.
>>  *
>>  * If the integer is negative, set address@hidden to 0, and return -ERANGE.
>
> This isn't the case in the newer version.
>
>>  * If the integer overflows unsigned long long, set address@hidden to
>>  * ULLONG_MAX, and return -ERANGE.
>>  * Else, set address@hidden to the parsed integer, and return 0.
>>  */
>
> Thanks for taking the time to write this. I hate having to look up
> what's the right syntax to use in docstrings, so I often just write
> plain comments before functions.  :-)

I don't particularly like GLib-style doc-strings, but it's what we use
when we use anything, which is rarely.

> I have a minor problem with the documentation above: as a developer, the
> most important question I have is: what's the difference between using
> parse_uint() and using strtoull() directly? But maybe it is a good
> thing: instead of referencing the strtoull() specification (and an
> implementation detail), now we have a well-defined and well-documented
> behavior.

I understand why you're asking for the difference to stroull().  Trouble
is strtoull() is complex, and often misunderstood.

>> > + *
>> > + * The 'base' parameter has the same meaning of 'base' on strtoull().
>> > + *
>> > + * Returns 0 on success, negative errno value on error.
>> > + */
>> > +int parse_uint(const char *s, unsigned long long *value, char **endptr,
>> > +               int base)
>> > +{
>> > +    int r = 0;
>> > +    char *endp = (char *)s;
>> > +    unsigned long long val = 0;
>> > +
>> > +    if (!s) {
>> > +        r = -EINVAL;
>> > +        goto out;
>> > +    }
>> > +
>> > +    /* make sure we reject negative numbers: */
>> > +    while (isspace((unsigned char)*s)) {
>> > +        ++s; endp++;
>> 
>> Mixing pre- and post-increment like that is ugly.  Recommend to stick to
>> post-increment.
>
> Agreed. I blame the copy & paste I made from opts-visitor. Later I added
> the postfix increment without noticing how ugly it looked like
>
> Anyway, this was changed in the newer patch version.
>
>> 
>> I'd set endp after the loop.  Right before goto.
>
> Fixed in the newer version.
>
>> 
>> Or simply change the specification to set *endptr to point beyond the
>> integer instead of to the '-'.  I took the liberty to do exactly that in
>> my proposed function comment.
>
> The newer version was changed to return -EINVAL and set endptr to the
> beginning of the string.
>
>> 
>> > +    }
>> > +    if (*s == '-') {
>> > +        r = -ERANGE;
>> > +        goto out;
>> > +    }
>> 
>> This creates the special case that made me go "ugh" above.  Suggest to
>> drop the if here, and check right after strtoull() instead, as shown
>> below.  That way, we get the same endptr behavior for all out-of-range
>> numbers.
>
> Is returning -EINVAL acceptable for you, as well? In your proposal we
> consider "-1234" part of the language but out-of-range. Returning
> -EINVAL, we consider "-1234" not part of the language, and invalid
> input. The newer version returns -EINVAL.

I prefer -ERANGE on underflow, for symmetry with parsing *signed*
integers.  A similar parsing function for signed integers would surely
assign LLONG_MIN and return -ERANGE then.

A secondary, weak argument: caller can more easily distinguish the
failure modes:

* -EINVAL: @s invalid, @base invalid (both programming errors), or @s
  doesn't start with an integer.  I use "integer" in the mathematical
  sense here.

* -ERANGE, *value == 0: integer underflows *value

* -ERANGE, *value == ULLONG_MAX: integer overflows *value.

If we lump underflow into the -EINVAL case, you have to check *endptr ==
'-' to find it.

The argument is weak, because I don't have a caller ready that actually
wants to find it.

>> > +
>> > +    errno = 0;
>> > +    val = strtoull(s, &endp, base);
>> 
>>        if (*s = '-') {
>>            r = -ERANGE;
>>            val = 0;
>>            goto out;
>>        }
>
> This has another bug: makes endptr point to the middle of the string in
> case we are parsing "   xxx" or "   -1234".

It makes endptr point right behind the integer, doesn't it?

When parsing "   xxx", it points to the first 'x' (we skipped the spaces
already).

When parsing "   -1234", it points behind '4'.

>> > +    if (errno) {
>> > +        r = -errno;
>> > +        goto out;
>> > +    }
>> > +
>> > +    if (endp == s) {
>> > +        r = -EINVAL;
>> > +        goto out;
>> > +    }
>> > +
>> > +out:
>> > +    *value = val;
>> > +    *endptr = endp;
>> > +    return r;
>> > +}
>> > +
>> > +/* Try to parse an unsigned integer, making sure the whole string is 
>> > parsed
>> > + *
>> > + * Have the same behavior of parse_uint(), but with an additional check
>> > + * for additional data after the parsed number (in that case, the function
>> > + * will return -EINVAL).
>> 
>> What's assigned to *value then?
>
> Undefined.  :-)
>
>
>> 
>> > + */
>> 
>> Alternatively, you could make into parse_uint() do that check when
>> endptr is null, and drop this function.  Matter of taste.
>
> I considered doing that, but I believe it would be too subtle.
>
>> 
>> > +int parse_uint_full(const char *s, unsigned long long *value, int base)
>> > +{
>> > +    char *endp;
>> > +    int r;
>> > +
>> > +    r = parse_uint(s, value, &endp, base);
>> > +    if (r < 0) {
>> > +        return r;
>> > +    }
>> > +    if (*endp) {
>> 
>> Answering my own question: the parsed integer is assigned to *value then.
>
> I prefer to document it as undefined. If you want to use the parsed
> integer even if there's additional data after it, you should use
> parse_int() instead.

Maybe.

What I want is a proper function contract.  That includes how *value is
changed.  *Especially* when it's an unspecified change.

>> > +        return -EINVAL;
>> > +    }
>> > +
>> > +    return 0;
>> > +}
>> > +
>> >  int qemu_parse_fd(const char *param)
>> >  {
>> >      int fd;

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH 1/8 v4] cutils: unsigned int parsing functions, (continued)
- Re: [Qemu-devel] [PATCH 0/8] -numa option parsing fixes (v3), Eric Blake, 2013/01/16
- Re: [Qemu-devel] [PATCH for-1.4 0/8] -numa option parsing fixes (v3), Eduardo Habkost, 2013/01/28
  - Re: [Qemu-devel] [PATCH for-1.4 0/8] -numa option parsing fixes (v3), Eduardo Habkost, 2013/01/31

Prev by Date: Re: [Qemu-devel] bdrv_co_flush_to_disk() don't flush as expected
Next by Date: Re: [Qemu-devel] [PATCH v3] block: Monitor command commit neglects to report some errors
Previous by thread: Re: [Qemu-devel] [PATCH 1/8 v6] cutils: unsigned int parsing functions
Next by thread: Re: [Qemu-devel] [PATCH 0/8] -numa option parsing fixes (v3)
Index(es):
- Date
- Thread