pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Re: Initial API for the tokeniser module


From: Michael Gold
Subject: [pdf-devel] Re: Initial API for the tokeniser module
Date: Thu, 14 May 2009 18:38:54 -0400
User-agent: Mutt/1.5.18 (2008-05-17)

On Wed, May 13, 2009 at 20:43:16 +0200, address@hidden wrote:
...
>    I'm thinking that the "hex" attribute would make more sense as a
>    flag for pdf_token_write (if it's needed at all) -- it's not really
>    an attribute of a string, just an arbitrary decision on how to
>    write the string.
> 
> Agreed.
> 
>    In normal operation this should be decided automatically, maybe
>    based on a policy configured for the stream writer (e.g., "avoid
>    8-bit characters").
> 
> Since there are several ways to get the printed representation of a
> token, 

I guess you meant to write more here.

...
> If a token was written only partially because we got a PDF_EAGAIN
> condition, I can think on several alternatives:
> 
> a) The token structure is used to store the position of the last
>    character written in a stream. This is clearly inconvenient, since
>    a single token can be written into several token writers. The token
>    structure would need to store pointers to the token writers in
>    order to identify them. A mess.
> 
> b) The textual representation of the token is calculated in memory and
>    the token writer ask the stream if it will be possible to write
>    that information. Since we will have an implementation limit for
>    the length of the PDF strings, this option may be feasible. But
>    there are some problems with this approach: since several streams
>    can operate on the same media (the same fsys open file) we can have
>    a race condition between several token writers.
> 
> c) We implement a "transactional mode" in the streams, so if at some
>    point PDF_EAGAIN is returned it is possible to roll back the
>    operation. This would require a lot of work in the stream module,
>    and we would face similar problems there.
> 
> I think that this is the right time to formalize the usage of
> PDF_EAGAIN. The fsys module should be expanded to return that code and
> the implementation of the streams should be modified to propagate that
> value to the caller of pdf_stm_read/pdf_stm_write.

It looks to me like fsys-disk doesn't even provide a way to check for
errors (there are no ferror calls).  The documentation for
pdf_fsys_file_read says 0 will be returned on error, but 0 will also be
returned on EOF.  Some way to get the error code should be added (apart
from EAGAIN/EWOULDBLOCK, we may get errors like EIO, EFBIG, and ENOSPC;
and we should figure out what to do on EINTR).  Perhaps we should get
rid of the elem_size parameter too, since it may cause difficulties for
implementations that don't use fread (and the documentation for the
return value is wrong if elem_size != 1).

For now there's no implementation that can return EAGAIN since there's
no way to set a non-blocking mode.  At some point a function to do that
should be added.

The easiest way to handle EAGAIN in the token writer would be to have it
keep track of the number of bytes written; the caller would then try to
write the same token again (with the same flags) when the fd becomes
writable.

...
>    Constructors should be defined for each
>    type; based on the changes I made for pdf-obj.h previously, these could
>    be used:
>        pdf_status_t pdf_token_integer_new (int value,
>                                            pdf_token_t *obj);
> 
> We want to use pdf_i32_t here instead of an int. Also, I would rename
> the second parameter to "token".

OK.

>        pdf_status_t pdf_token_comment_new (const pdf_char_t *value,
>                                            pdf_size_t size,
>                                            pdf_bool_t continuation,
>                                            pdf_token_t *obj);
> 
> What is the purpose of the "continuation" flag?

This is used for comments too big to fit in the token reader's buffer --
it will split the comment into pieces, and set the continuation flag on
all pieces after the first.  Another way to handle this would be to
truncate the comment, but then we'd probably need a "truncated" flag.

(For other data types, we'll fail with an implementation limit when
there's too much data -- but it seems silly to fail because of a long
comment.)

...
>    Attribute accessors are also needed, and can be based on pdf-obj.h too.
> 
>    There were two additional tokeniser functions in my patch, intended for
>    dealing with streams (and I think they'll still be needed):
>        /* Advance to the first byte of a stream; see PDF32000 7.3.8.1
>         * note 2 (call this after reading the "stream" keyword) */
>        pdf_status_t pdf_tokeniser_end_at_stream(pdf_tokeniser_t tokr);
> 
>        /* Reset the state (e.g., after seeking past a stream) */
>        pdf_status_t pdf_tokeniser_reset_state(pdf_tokeniser_t tokr);
> 
>    (I can document all these extra functions if no changes are
>    needed.)
> 
> Please. Can you provide a patch against doc/gnupdf.texi incorporating
> the agreed points?
...

Yes, I'll work on that.

My patch added 2 new error types: PDF_EBADFILE (used when we know that
the file violates the PDF specs), and PDF_EIMPLLIMIT (the file may be
valid, but we can't read it because of an implementation limit).  Are
these OK?

Also, I had two flags for the token reader:
  _RET_COMMENTS  (return comments as tokens)
  _PDF11         (don't treat '#' as an escape character in names)
I didn't have a public function to set them, but one would need to be
added.  How about
  void pdf_token_reader_set_flags(int flags),
where 'flags' is a bitmask?  We may also need a _PDF11 flag for the
writer.

-- Michael

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]