pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Re: Initial API for the tokeniser module


From: jemarch
Subject: [pdf-devel] Re: Initial API for the tokeniser module
Date: Wed, 13 May 2009 20:43:16 +0200
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.92 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)

Hi Michael.

Many thanks for your comments.

   >   - pdf_token_t
   > 
   >     Data type representing a typed PDF token. Each pdf_token_t can
   >     have a number of attributes that may influence how the token is
   >     written (such as the use of the hex representation for strings).
   > 
   >     NOTE: an alternative would be to use flags to pdf_token_write().
   ...

   Are hex strings the only use case for attributes?  Generally
   there's not much choice about how to write a token, except for
   deciding which string (or name) characters to escape, when to use
   hex strings, etc.

   I'm thinking that the "hex" attribute would make more sense as a
   flag for pdf_token_write (if it's needed at all) -- it's not really
   an attribute of a string, just an arbitrary decision on how to
   write the string.

Agreed.

   In normal operation this should be decided automatically, maybe
   based on a policy configured for the stream writer (e.g., "avoid
   8-bit characters").

Since there are several ways to get the printed representation of a
token, 

   >   - Functions for reading and writing streams
   > 
   >     pdf_token_read (reader, &token)
   >     pdf_token_write (writer, token)

   These functions (and _new, _destroy) seem reasonable, but we should
   determine what happens when pdf_token_write writes a partial token
   to the stream (i.e., does the user have to call the function again
   with the same token?  And who keeps track of the position?).

If a token was written only partially because we got a PDF_EAGAIN
condition, I can think on several alternatives:

a) The token structure is used to store the position of the last
   character written in a stream. This is clearly inconvenient, since
   a single token can be written into several token writers. The token
   structure would need to store pointers to the token writers in
   order to identify them. A mess.

b) The textual representation of the token is calculated in memory and
   the token writer ask the stream if it will be possible to write
   that information. Since we will have an implementation limit for
   the length of the PDF strings, this option may be feasible. But
   there are some problems with this approach: since several streams
   can operate on the same media (the same fsys open file) we can have
   a race condition between several token writers.

c) We implement a "transactional mode" in the streams, so if at some
   point PDF_EAGAIN is returned it is possible to roll back the
   operation. This would require a lot of work in the stream module,
   and we would face similar problems there.

I think that this is the right time to formalize the usage of
PDF_EAGAIN. The fsys module should be expanded to return that code and
the implementation of the streams should be modified to propagate that
value to the caller of pdf_stm_read/pdf_stm_write.

   >   - Functions to manipulate token variables
   > 
   >     type = pdf_token_get_type (token)
   >     pdf_token_set_type (token, type)
   >     pdf_token_set_attribute (token, attribute)
   >     ...

   In the example for pdf_token_get_type, the return value of
   pdf_token_read should be checked (if it failed, _get_type would get an
   uninitialised pointer and probably crash).

Agreed.

   _set_type doesn't make sense.

Agreed.

   Constructors should be defined for each
   type; based on the changes I made for pdf-obj.h previously, these could
   be used:
       pdf_status_t pdf_token_integer_new (int value,
                                           pdf_token_t *obj);

We want to use pdf_i32_t here instead of an int. Also, I would rename
the second parameter to "token".

       pdf_status_t pdf_token_comment_new (const pdf_char_t *value,
                                           pdf_size_t size,
                                           pdf_bool_t continuation,
                                           pdf_token_t *obj);

What is the purpose of the "continuation" flag?

       /* is _dup needed? */
       pdf_status_t pdf_token_dup (const pdf_token_t obj,
                                   pdf_token_t *new);

It is convenient to have a _dup function.

   Attribute accessors are also needed, and can be based on pdf-obj.h too.

   There were two additional tokeniser functions in my patch, intended for
   dealing with streams (and I think they'll still be needed):
       /* Advance to the first byte of a stream; see PDF32000 7.3.8.1
        * note 2 (call this after reading the "stream" keyword) */
       pdf_status_t pdf_tokeniser_end_at_stream(pdf_tokeniser_t tokr);

       /* Reset the state (e.g., after seeking past a stream) */
       pdf_status_t pdf_tokeniser_reset_state(pdf_tokeniser_t tokr);

   (I can document all these extra functions if no changes are
   needed.)

Please. Can you provide a patch against doc/gnupdf.texi incorporating
the agreed points?

   > The idea of this module is to make it independent from the parser that
   > will be implemented in the object layer. Also, it will be used by the
   > type 4 functions implementation in pdf-fp-func.[ch]. It would be quite
   > useful for the user, also.
   > 
   > At this point it is critical to identify the needed token types.

   They're listed in the original patch (enum pdf_token_type_e):
     WSPACE      (not needed, but may be useful to someone)
     COMMENT     (for handling "%PDF-" headers, "%%EOF" footers, etc.)
     KEYWORD     (any alphanumeric string not matching another token type;
                  includes "null", content stream ops, etc.)
     INTEGER
     REAL
     NAME        (starts with "/")
     STRING
     DICT_START  ("<<")
     DICT_END    (">>")
     ARRAY_START ("[")
     ARRAY_END   ("]")
     PROC_START  ("{", for type 4 functions)
     PROC_END    ("}")

Ok.

   Your list included BOOLEAN and NULL types, but these should be tokenised
   as type KEYWORD.  The parser will convert them to the proper object
   types.

Ok.

   INDIRECT was also listed.  Should this be REF?  That would be tokenised
   as INTEGER, INTEGER, KEYWORD ("R").

Yes, it was for the R. I agree in that it should be tokenised as
KEYWORD.

-- 
Jose E. Marchesi
address@hidden

GNU Project
http://www.gnu.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]