pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Re: Initial API for the tokeniser module


From: jemarch
Subject: [pdf-devel] Re: Initial API for the tokeniser module
Date: Fri, 15 May 2009 14:17:59 +0200
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.92 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)

Hi Michael.

   >    I'm thinking that the "hex" attribute would make more sense as a
   >    flag for pdf_token_write (if it's needed at all) -- it's not really
   >    an attribute of a string, just an arbitrary decision on how to
   >    write the string.
   > 
   > Agreed.
   > 
   >    In normal operation this should be decided automatically, maybe
   >    based on a policy configured for the stream writer (e.g., "avoid
   >    8-bit characters").
   > 
   > Since there are several ways to get the printed representation of a
   > token, 

   I guess you meant to write more here.

Heh, yes :)

I wanted to say that since there are several ways to get the printed
representation of a token, we should provide a fine grained control to
the client in this level. Upper layers of the library will take care,
for example, about PDF x.y compatibility.

The same applies to the interpretation of the textual forms for
tokens.

   > If a token was written only partially because we got a PDF_EAGAIN
   > condition, I can think on several alternatives:
   > 
   > a) The token structure is used to store the position of the last
   >    character written in a stream. This is clearly inconvenient, since
   >    a single token can be written into several token writers. The token
   >    structure would need to store pointers to the token writers in
   >    order to identify them. A mess.
   > 
   > b) The textual representation of the token is calculated in memory and
   >    the token writer ask the stream if it will be possible to write
   >    that information. Since we will have an implementation limit for
   >    the length of the PDF strings, this option may be feasible. But
   >    there are some problems with this approach: since several streams
   >    can operate on the same media (the same fsys open file) we can have
   >    a race condition between several token writers.
   > 
   > c) We implement a "transactional mode" in the streams, so if at some
   >    point PDF_EAGAIN is returned it is possible to roll back the
   >    operation. This would require a lot of work in the stream module,
   >    and we would face similar problems there.
   > 
   > I think that this is the right time to formalize the usage of
   > PDF_EAGAIN. The fsys module should be expanded to return that code and
   > the implementation of the streams should be modified to propagate that
   > value to the caller of pdf_stm_read/pdf_stm_write.

   It looks to me like fsys-disk doesn't even provide a way to check for
   errors (there are no ferror calls).  The documentation for
   pdf_fsys_file_read says 0 will be returned on error, but 0 will also be
   returned on EOF.  Some way to get the error code should be added (apart
   from EAGAIN/EWOULDBLOCK, we may get errors like EIO, EFBIG, and ENOSPC;
   and we should figure out what to do on EINTR).

Yes, the pdf_fsys_file_read/write interface should be extended to
return a pdf_status_t value.

   Perhaps we should get rid of the elem_size parameter too, since it
   may cause difficulties for implementations that don't use fread
   (and the documentation for the return value is wrong if elem_size
   != 1).

After reading the emails on this regard, I should agree. The existence
of elem_size and elem_count in pdf_fsys_file_read/write is inherited
from the libc file read/write calls, and we can have quite different
types of filesystems.

   For now there's no implementation that can return EAGAIN since there's
   no way to set a non-blocking mode.  At some point a function to do that
   should be added.

Maybe would be good to write a little read-only http filesystem and
incorporate it to the library.  That would allow us to test the
non-blocking capabilities of pdf_fsys and also the read-in-advance
functions.  It would also work as an example of non-disk based
filesystems for people interested in writing their own fsys
implementations.

I think that Aleksander was interested in implementing it.

   The easiest way to handle EAGAIN in the token writer would be to have it
   keep track of the number of bytes written; the caller would then try to
   write the same token again (with the same flags) when the fd becomes
   writable.

The number of bytes written in the last write operation... that makes
sense.

   >        pdf_status_t pdf_token_comment_new (const pdf_char_t *value,
   >                                            pdf_size_t size,
   >                                            pdf_bool_t continuation,
   >                                            pdf_token_t *obj);
   > 
   > What is the purpose of the "continuation" flag?

   This is used for comments too big to fit in the token reader's buffer --
   it will split the comment into pieces, and set the continuation flag on
   all pieces after the first.  Another way to handle this would be to
   truncate the comment, but then we'd probably need a "truncated" flag.

Ok. The continuation flag seems ok.

   (For other data types, we'll fail with an implementation limit when
   there's too much data -- but it seems silly to fail because of a long
   comment.)

Please don't forget to document any implementation limit you may
introduce in the "Implementation Limits" chapter in the reference
manual.

   >    Attribute accessors are also needed, and can be based on pdf-obj.h too.
   > 
   >    There were two additional tokeniser functions in my patch, intended for
   >    dealing with streams (and I think they'll still be needed):
   >        /* Advance to the first byte of a stream; see PDF32000 7.3.8.1
   >         * note 2 (call this after reading the "stream" keyword) */
   >        pdf_status_t pdf_tokeniser_end_at_stream(pdf_tokeniser_t tokr);
   > 
   >        /* Reset the state (e.g., after seeking past a stream) */
   >        pdf_status_t pdf_tokeniser_reset_state(pdf_tokeniser_t tokr);
   > 
   >    (I can document all these extra functions if no changes are
   >    needed.)
   > 
   > Please. Can you provide a patch against doc/gnupdf.texi incorporating
   > the agreed points?
   ...

   Yes, I'll work on that.

Great, thanks.

   My patch added 2 new error types: PDF_EBADFILE (used when we know that
   the file violates the PDF specs), and PDF_EIMPLLIMIT (the file may be
   valid, but we can't read it because of an implementation limit).  Are
   these OK?

I find them appropriate.

   Also, I had two flags for the token reader:
     _RET_COMMENTS  (return comments as tokens)
     _PDF11         (don't treat '#' as an escape character in names)
   I didn't have a public function to set them, but one would need to be
   added.  How about
     void pdf_token_reader_set_flags(int flags),
   where 'flags' is a bitmask?  We may also need a _PDF11 flag for the
   writer.

I would use _SHARP_ESCAPE instead of _PDF11. Upper layers will care
about PDFx.y portability.

Also, I think that it would be good to use the PDF_ prefix for these
flags.

-- 
Jose E. Marchesi
address@hidden

GNU Project
http://www.gnu.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]