Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script

From:	Marc-André Lureau
Subject:	Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script
Date:	Wed, 02 Nov 2016 09:25:24 +0000
Hi

On Fri, Oct 28, 2016 at 7:45 PM Markus Armbruster <address@hidden> wrote:

> Marc-André Lureau <address@hidden> writes:
>
> > As the name suggests, the qapi2texi script converts JSON QAPI
> > description into a standalone texi file suitable for different target
> > formats.
> >
> > It parses the following kind of blocks with some little variations:
> >
> >   ##
> >   # = Section
> >   # == Subsection
> >   #
> >   # Some text foo with *emphasis*
> >   # 1. with a list
> >   # 2. like that
> >   #
> >   # And some code:
> >   # | $ echo foo
> >   # | <- do this
> >   # | -> get that
> >   #
> >   ##
> >
> >   ##
> >   # @symbol
> >   #
> >   # Symbol body ditto ergo sum. Foo bar
> >   # baz ding.
> >   #
> >   # @arg: foo
> >   # @arg: #optional foo
> >   #
> >   # Returns: returns bla bla
> >   #
> >   #          Or bla blah
> >   #
> >   # Since: version
> >   # Notes: notes, comments can have
> >   #        - itemized list
> >   #        - like this
> >   #
> >   #        and continue...
> >   #
> >   # Example:
> >   #
> >   # <- { "execute": "quit" }
> >   # -> { "return": {} }
> >   #
> >   ##
> >
> > Thanks to the json declaration, it's able to give extra information
> > about the type of arguments and return value expected.
> >
> > Signed-off-by: Marc-André Lureau <address@hidden>
> > ---
> >  scripts/qapi.py        | 100 +++++++++++++++-
> >  scripts/qapi2texi.py   | 314
> +++++++++++++++++++++++++++++++++++++++++++++++++
> >  docs/qapi-code-gen.txt |  44 +++++--
> >  3 files changed, 446 insertions(+), 12 deletions(-)
> >  create mode 100755 scripts/qapi2texi.py
>
> My review will be easier to follow if you skip ahead qapi-code-gen.txt,
> then go to class QAPISchemaParser, then come back here.
>
> > diff --git a/scripts/qapi.py b/scripts/qapi.py
> > index 21bc32f..4efc7e7 100644
> > --- a/scripts/qapi.py
> > +++ b/scripts/qapi.py
> > @@ -122,6 +122,79 @@ class QAPIExprError(Exception):
> >              "%s:%d: %s" % (self.info['file'], self.info['line'],
> self.msg)
> >
> >
> > +class QAPIDoc:
>
> Old-style class.  Make this
>
>    class QAPIDoc(object):
>

ok


>
> > +    def __init__(self, comment):
> > +        self.symbol = None
> > +        self.comment = "" # the main symbol comment
>
> PEP8 wants at least two spaces before #, and I prefer the # in colum 32.
>

let's just remove that comment


>
> > +        self.args = OrderedDict()
> > +        # meta is for Since:, Notes:, Examples:, Returns:...
> > +        self.meta = OrderedDict()
> > +        # the current section to populate, array of [dict, key,
> comment...]
> > +        self.section = None
> > +
> > +        for line in comment.split('\n'):
>
> Makes me wonder whether the caller should pass a list of lines instead
> of a string that needs to be split into lines.
>
> > +            # remove multiple spaces
> > +            sline = ' '.join(line.split())
>
> Actually, this doesn't "remove multiple spaces", it squeezes sequences
> of whitespace into a single space character.  Rather inefficiently, I
> suspect, but let's not worry about that now.
>
> > +            # take the first word out
> > +            split = sline.split(' ', 1)
> > +            key = split[0]
> > +
> > +            if key.startswith("@"):
> > +                key = key[1:].rstrip(':')
>
> Treats the colon as optional.  I understand you're trying to be
> unintrusive, but since we're parsing comments already, we can make the
> parser do double duty and enforce consistent comment formatting.  But
> this isn't the place to discuss the format we want, review of
> qapi-code-gen.txt is.
>

Indeed, I was being permissive: my initial plan was rather to have
something quickly last year. It's simply comments, it does not have to be
rigorous as code to me. Now that we are taking >1.5y to get it in, we may
as well enforce more stuff from the start. Obviously that kind of feedback
would have been nice earlier.  (needless to say my motivation is lower than
when I started)

>
> > +                sline = split[1] if len(split) > 1 else ""
> > +                if self.symbol is None:
> > +                    # the first is the section symbol
> > +                    self.symbol = key
>
> Let's not say "the section symbol".  It's the symbol this APIDoc object
> documents.  The APIDoc object has sections.
>

ok


>
> > +                else:
> > +                    # else an arg
> > +                    self.start_section(self.args, key)
> > +            elif self.symbol and \
> > +                    key in ("Returns:",
> > +                            # special case for Since often used without:
> > +                            "Since:", "Since",
> > +                            # those are often singular or plural
> > +                            "Note:", "Notes:",
> > +                            "Example:", "Examples:"):
> > +                sline = split[1] if len(split) > 1 else ""
> > +                line = None
> > +                # new "meta" section
> > +                self.start_section(self.meta, key.rstrip(':'))
> > +
> > +            if self.section and self.section[1].startswith("Example"):
> > +                # example is verbatim
> > +                self.append_comment(line)
> > +            else:
> > +                self.append_comment(sline)
>
> Sections other than "Example" and "Examples" have whitespace squeezed.
> I believe this is the true reason for the squeezing.  There are a few
> other places that rely on it, but they could do just as well without.
>

Overall, it should help produce a more homogeneous output. Any issue with
that, or should I try to remove the whitespace squeeze?

>
> > +
> > +        self.end_section()
> > +
> > +    def append_comment(self, line):
> > +        """Adds a comment to the current section, or the symbol
> comment"""
> > +        if line is None:
> > +            return
> > +        if self.section is not None:
> > +            if self.section[-1] == "" and line == "":
> > +                self.end_section()
> > +            else:
> > +                self.section.append(line)
> > +        elif self.comment == "":
> > +            self.comment = line
> > +        else:
> > +            self.comment += "\n" + line
>
> Accumulating a list and joining at the end might be easier.
>

Yes, I am not trying to be the fastest python doc parser ;) I'll try to
improve it.


>
> > +
> > +    def end_section(self):
> > +        if self.section is not None:
> > +            dic = self.section[0]
>
> Calling a dictionary "dic" could be considered a dick move ;-P
>

 dict is built-in, let's call it "target"


> > +            key = self.section[1]
> > +            doc = "\n".join(self.section[2:])
> > +            dic[key] = doc
>
> Silently trashes previous value of dic[key].
>

As this is a loosy parser, it was by design ;). But if we want to enforce
things, it could as well throw an error.


>
> > +            self.section = None
> > +
> > +    def start_section(self, dic, key):
> > +        self.end_section()
> > +        self.section = [dic, key]  # .. remaining elems will be the doc
>
> Taken together:
>
> * The QAPI parser is hacked to accumulate and exfiltrate some comments
>   to the QAPIDoc parser.
>
> * A QAPIDoc can have a symbol, a symbol comment, and sections.
>
> * Certain patterns in the comment text start and end sections.
>
>   Text within a section belongs to the section.  Text outside sections
>   belongs to the symbol comment.  Unlikely to work unless the symbol
>   comment is entirely before the first section.
>
> * Two kinds of sections, argument and meta:
>
>   - An argument section is uniquely identified by the argument's name
>
>     Multiple argument sections with the same name make no sense.  The
>     code silently ignores all but the last one.
>
>   - A meta section is uniqely identified by a keyword
>
>     The keywords are: "Returns", "Since", "Note", "Notes", "Example",
>     "Examples".
>
>     Multiple meta sections with the same key make sense for some keys
>     ("Note") but not for others ("Returns", "Since").  Nevertheless, the
>     code silently ignores all but the last one.  You can't have two
>     "Example" sections, but you can have an "Example" and an "Examples"
>     section.  I'm afraid the data structure isn't quite right here.
>

> * No comment is ever rejected.  Comments adhering to the format we have
>   in mind generate correct documentation.  Mistakes may produce correct
>   documentation, subtly incorrect documentation, or obvious garbage.
>   When you get garbage (obvious or not), you have to guess what's wrong
>   with your comment.  "Fortunately", developers generally won't have to
>   do that, because they generally don't read the generated
>   documentation.
>
> General remarks:
>
> * I feel the only way to keep the API documentation in order as we
>   develop is to have the parser reject malformed API comments.
>
> ok


> * The only way to create a robust and maintainable parser is to specify
>   the language *first* and *rigorously*.
>

I am not convinced rigorous is necessary here. After all, plenty of
successful doc/markup languauges are not that rigorous.

>
> * Stacking parsers is not a good idea.  Figuring out one parser's
>   accepted language is hard enough.
>

Does that mean you would prefer to have the doc parsing inside
QAPISchemaParser? I can go in that direction if that's correct, but I fear
it will interfere with the json parsing...


>
> * For me, the parsing part of this patch is a prototype.  Possibly even
>   good enough to commit, with some work.  But it'll have to be redone
>   from first principles regardless.
>

While I agree with you that having a strict parser would be an improvement,
I don't feel it's necessary as a first approach. Since qapi & doc is
internal to qemu, it's easy enough to evolve without risking to break
outside-qemu users.


>
> > +
> > +
> >  class QAPISchemaParser(object):
> >
> >      def __init__(self, fp, previously_included=[], incl_info=None):
> > @@ -137,11 +210,14 @@ class QAPISchemaParser(object):
> >          self.line = 1
> >          self.line_pos = 0
> >          self.exprs = []
> > +        self.comment = None
> > +        self.apidoc = incl_info['doc'] if incl_info else []
> >          self.accept()
> >
> >          while self.tok is not None:
> >              expr_info = {'file': fname, 'line': self.line,
> > -                         'parent': self.incl_info}
> > +                         'parent': self.incl_info, 'doc': self.apidoc}
> > +            self.apidoc = []
> >              expr = self.get_expr(False)
> >              if isinstance(expr, dict) and "include" in expr:
> >                  if len(expr) != 1:
>
> Before your patch, expr_info holds the (top-level) expression's location
> plus the list of include directive locations that led to the expression.
>
> You extend it by key 'doc'.  Let's see what it's good for.
>
> At every (sub-)expression, we capture the current state of self.apidoc
> in expr_info['doc'], then reset self.apidoc to [].  Odd; I would expect
> API documentation to attach only to top-level expressions.
>
>
Isn't  QAPISchemaParser __init__ loop only for top-level exprs?

Looks like expr_info['doc'] and self.apidoc are lists, and self.apidoc
> accumulates documentation until we move it to the next expression's
> expr_info['doc'].  In other words, self.apidoc accumulates documentation
> for the next expression.
>
> yes


> I don't quite understand why you initialize self.apidoc the way you do:
> if this file is included, self.apidoc is initialized to the include
> directive's expr_info['doc'], else to the empty list.  What are you
> trying to accomplish by that?
>

We have comments associated only to expressions. Changing that would be
another major change. It accumulates previous comments across files
boundaries, so you can have:

##
# = Introduction
# blah blah

include foo.json (with exprs)

##
# @sym:
#...
##


The "Introduction" would be attached to the following foo.json expr. That
way it keeps read order.

>
> > @@ -162,6 +238,8 @@ class QAPISchemaParser(object):
> >                      inf = inf['parent']
> >                  # skip multiple include of the same file
> >                  if incl_abs_fname in previously_included:
> > +                    expr_info['doc'].extend(self.apidoc)
> > +                    self.apidoc = expr_info['doc']
> >                      continue
> >                  try:
> >                      fobj = open(incl_abs_fname, 'r')
>
> Before your patch, we indeed skip, as the comment says.  After your
> patch, we do something that's not yet obvious to me, then skip.  The
> comment isn't quite right anymore.
>
> Let's see what we do when we include a file a second time:
>
> 1. We append self.apidoc to expr_info['doc'].  But self.apidoc is []
> here, isn't it?
>
> 2. We assign the result back to self.apidoc.
>
> I think this is effectively
>
>                        self.apidoc = expr_info['doc']
>
> What are you trying to accomplish?
>

It should restore self.apidoc as if nothing happened, so the apidoc is
accumulated for the next expr in the current file. self.apidoc could have
accumulated since we call self.get_expr() (there is a small output diff I
remove remove the extend)


> > @@ -176,6 +254,12 @@ class QAPISchemaParser(object):
> >                               'info': expr_info}
> >                  self.exprs.append(expr_elem)
> >
> > +    def append_doc(self):
> > +        if self.comment:
> > +            apidoc = QAPIDoc(self.comment)
> > +            self.apidoc.append(apidoc)
> > +            self.comment = None
> > +
>
> Too many things called apidoc.  I'd elide the local variable:
>
>                self.apidoc.append(QAPIDoc(self.comment))
>
> Neither the role of self.comment nor self.apidoc are obvious just yet.
>
> self.apidoc is the accumulated QAPIDoc list (made from self.comment on
append_doc())
self.comment is the accumulated comment string since last append_doc(),

>      def accept(self):
> >          while True:
> >              self.tok = self.src[self.cursor]
> > @@ -184,8 +268,20 @@ class QAPISchemaParser(object):
> >              self.val = None
> >
> >              if self.tok == '#':
> > -                self.cursor = self.src.find('\n', self.cursor)
> > +                end = self.src.find('\n', self.cursor)
> > +                line = self.src[self.cursor:end+1]
>
>                    self.cursor = self.src.find('\n', self.cursor)
>                    line = self.src[self.pos:self.cursor+1]
>
> and drop the self.cursor = end below.
>
> @line isn't actually the source line, it's the source line with the
> initial '#' chopped off.  Hmm.
>
> Note that .find() can't fail, because we made sure self.src ends with
> '\n'.
>
>
ok


> > +                # start a comment section after ##
> > +                if line[0] == "#":
> > +                    if self.comment is None:
> > +                        self.comment = ""
> > +                # skip modeline
> > +                elif line.find("-*") == -1 and self.comment is not None:
> > +                    self.comment += line
>
> Aha: self.comment is either None or a string.  When it's a string, we're
> accumulating a (multi-line) comment there.  We start accumulating when a
> comment starts with '##'.  We stop by calling self.append_doc(), which
> processes the accumulated comment, and appends the result to the list
> self.apidoc.
>
> If a comment starting with '##' occurs while we're already accumulating,
> we silently ignore it.  I don't think that's a good idea.
>
>
It allows to write things like:

##
# = Intro

##
# == Foo
#
# blah
#

##
# @expr:
##
expr

Acceptable, unacceptable? It's not my main concern.

Likewise, we silently ignore any comment lines containing '-*'.  That's
> definitely not a good idea; your pattern is far too loose.  Why do we
> need to ignore such lines?
>

we have modelines:
 # -*- Mode: Python -*-

Again we could be stricter here, that wasn't my goal


> > +                if self.src[end] == "\n" and self.src[end+1] == "\n":
>
> If this is the last line, self.src[end+1] is out of bounds, I'm afraid.
>

Right, that could eventually happen, it didn't reach it though.


> > +                    self.append_doc()
>
> We stop accumulating when a comment line is followed by a blank line,
> and ...
>
> > +                self.cursor = end
> >              elif self.tok in "{}:,[]":
> > +                self.append_doc()
>
> ... when we recognize one of the tokens '{', '}', ':', ',' '[', ']'.
>
> This isn't a parser, it's a hack :)


> In this contrieved example, your code joins the two comments into one:
>
>     { 'command: 'contrived-example-1', 'gen':
>     ## lorem ipsum
>       false
>     # dolor sit amet
>     }
>
> In this one, it doesn't:
>
>     { 'command: 'contrived-example-2', 'gen':
>     ## lorem ipsum
>       false }
>     # dolor sit amet
>
> In my opinion, this needs a healthy dose of rigor.  Make API comments a
> proper *syntactical* construct: starts with a comment token whose text
> is exactly '##', extends until the first non-comment token.
>
> Two ways to parse this:
>
> * Lexically, i.e. gobble up the whole comment block as a single token.
>   Drawback: tokens spanning lines defeat the elegant treatment of
>   newline in one place.  Meh.
>
> * Treat it as an expression: have a method get_api_comment() that keeps
>   calling self.accept() until it sees a non-comment token.  To ease
>   getting the comment text, we can make self.accept() store it in
>   self.val.
>
> This also reduces modifiable state: self.comment goes away.  Non-local
> modifiable state is evil.
>

 Ok I'll try to treat it as an expression.


> Now let me try to summarize how this works with parsing details ignored:
>
> 0. We accumulate the next expression's API documentation in self.apidoc,
> a list of QAPIDoc.
>
> 1. We accumulate the current API comment in self.comment.
>
> 2. When it's complete, we call self.append_doc().  append_doc()
> processes self.comment into a QAPIDoc, which it appends to self.apidoc,
> then resets self.comment.
>
> 3. When we recognize the start of an expression, we move self.apidoc to
> the expressions expr_info['doc'].
>
> Taken together: each (sub-)expression has a list of QAPIDoc objects that
> got recognized since the last start of a (sub-)expression.
>
> Now go back to class QAPIDoc.
>
> >                  return
> >              elif self.tok == "'":
> >                  string = ''
> > diff --git a/scripts/qapi2texi.py b/scripts/qapi2texi.py
> > new file mode 100755
> > index 0000000..2706b16
> > --- /dev/null
> > +++ b/scripts/qapi2texi.py
> [Skipping the generator for now]
> > diff --git a/docs/qapi-code-gen.txt b/docs/qapi-code-gen.txt
> > index 5d4c2cd..e51ae4c 100644
> > --- a/docs/qapi-code-gen.txt
> > +++ b/docs/qapi-code-gen.txt
> > @@ -45,16 +45,13 @@ QAPI parser does not).  At present, there is no
> place where a QAPI
> >  schema requires the use of JSON numbers or null.
> >
> >  Comments are allowed; anything between an unquoted # and the following
> > -newline is ignored.  Although there is not yet a documentation
> > -generator, a form of stylized comments has developed for consistently
> > -documenting details about an expression and when it was added to the
> > -schema.  The documentation is delimited between two lines of ##, then
> > -the first line names the expression, an optional overview is provided,
> > -then individual documentation about each member of 'data' is provided,
> > -and finally, a 'Since: x.y.z' tag lists the release that introduced
> > -the expression.  Optional members are tagged with the phrase
> > -'#optional', often with their default value; and extensions added
> > -after the expression was first released are also given a '(since
> > +newline is ignored.  The documentation is delimited between two lines
> > +of ##, then the first line names the expression, an optional overview
> > +is provided, then individual documentation about each member of 'data'
> > +is provided, and finally, a 'Since: x.y.z' tag lists the release that
> > +introduced the expression.  Optional members are tagged with the
> > +phrase '#optional', often with their default value; and extensions
> > +added after the expression was first released are also given a '(since
> >  x.y.z)' comment.  For example:
> >
> >      ##
> > @@ -73,12 +70,39 @@ x.y.z)' comment.  For example:
> >      #           (Since 2.0)
> >      #
> >      # Since: 0.14.0
> > +    #
> > +    # Notes: You can also make a list:
> > +    #        - with items
> > +    #        - like this
> > +    #
> > +    # Example:
> > +    #
> > +    # <- { "execute": ... }
> > +    # -> { "return": ... }
> > +    #
> >      ##
> >      { 'struct': 'BlockStats',
> >        'data': {'*device': 'str', 'stats': 'BlockDeviceStats',
> >                 '*parent': 'BlockStats',
> >                 '*backing': 'BlockStats'} }
> >
> > +It's also possible to create documentation sections, such as:
> > +
> > +    ##
> > +    # = Section
> > +    # == Subsection
> > +    #
> > +    # Some text foo with *emphasis*
> > +    # 1. with a list
> > +    # 2. like that
> > +    #
> > +    # And some code:
> > +    # | $ echo foo
> > +    # | <- do this
> > +    # | -> get that
> > +    #
> > +    ##
> > +
> >  The schema sets up a series of types, as well as commands and events
> >  that will use those types.  Forward references are allowed: the parser
> >  scans in two passes, where the first pass learns all type names, and
>
> This may be good enough to guide users, but it's not rigorous enough to
> actually implement a parser.  The API doc language needs to be specified
> as a grammar.  If not here, then elsewhere.
>
>
api_comment = "##\n" comment "##\n"
comment = freeform_comment | symbol_comment
freeform_comment = { "#" text "\n" }
symbol_comment = "#" "@" name ":\n" { freeform | member | meta }
member = "#" '@' name ':' [ text ] freeform_comment
meta = "#" ( "Returns:", "Since:", "Note:", "Notes:", "Example:",
"Examples:" ) [ text ] freeform_comment

text = free-text markdown-like, "#optional" for members

The grammar gives a lot of free-form, and it's hard to describe in
details.  It's already barely readable imho and it's missing many details.
Do you think people will actually care and read it? Or they would just look
at example, copy, fix, and iterate. I would simply make the doc parser a
bit more robust to report anomalies.

If we want to have a very strict grammar, there is even more work needed to
cleanup the documentation. I don't fancy doing that work, it gives only low
benefits.

Now go back to QAPISchemaParser.
>
> --
Marc-André Lureau
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script, Marc-André Lureau <=
- Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script, Markus Armbruster, 2016/11/03
  - Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script, Marc-André Lureau, 2016/11/03
Prev by Date: Re: [Qemu-devel] [PULL 6/6] curses: Use cursesw instead of curses
Next by Date: Re: [Qemu-devel] [PATCH] vl.c: move pidfile creation up
Previous by thread: Re: [Qemu-devel] [PULL 6/6] curses: Use cursesw instead of curses
Next by thread: Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script
Index(es):
- Date
- Thread