pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Object layer API


From: Michael Gold
Subject: Re: [pdf-devel] Object layer API
Date: Tue, 17 Feb 2009 16:32:39 -0500
User-agent: Mutt/1.5.18 (2008-05-17)

On Tue, Feb 17, 2009 at 00:43:48 +0100, address@hidden wrote:
>    Can you post a version online that doesn't contain so many pages?
>    The object layer chapter alone contains over 20 pages, which makes
>    it very annoying to read -- a single downloadable file would be
>    nice.
> 
> I though that the info format is quite convenient. You can generate an
> info file and navigate into it using Emacs or any info reader, and you
> can send suggested improvements as patches.

I don't have a problem with the info format, but I find documents with
lots of tiny pages annoying for online viewing (especially when the
connection latency is high).  Many GNU manuals are available in multiple
formats -- see the GNU tar manual for example:
  http://www.gnu.org/software/tar/manual/index.html

Something like that would be nice for GNU PDF.  I could generate other
forms of documentation myself, but I guess other people would also find
it find it useful to have a choice of formats for viewing/downloading.

I can look into autogenerating the other formats if you're willing to
put them online.

>    pdf_obj_doc_open
>     - Would it be better to take an open stream, rather than a filesystem
>       and path?
> 
> Why? PDF documents are stored in files, one document per file,
> regardless the kind of file system providing the file.

I'm not sure exactly, but it seems more general.  Maybe there are
certain workflows for which the application would want to open a PDF
file from an FD instead of a filename (e.g. to avoid a race condition,
or to read from an FD passed over a Unix socket).

>     - What's the point of header_string?
>        - The client probably doesn't care about the header -- they just
>          want the library to open a valid PDF.
> 
> The user may want to open non-pdf file such as a FDF file, that uses
> the "%FDF-" header. Also, we may want to introduce new headers such as
> "%GNUPDF-".

Why would we introduce a new header?

For the standard headers, the client could check using the
pdf_obj_doc_get_header you proposed, but that still seems like a
low-level detail they shouldn't have to deal with.

We could have separate functions for opening PDF or FDF, or a function
that returns the type (PDF or FDF).

>    pdf_obj_doc_save
>     - Why wouldn't you want an xref table?  It should probably be written
>       by default.
> 
> The cross reference table in FDF files is optional, for instance.
> But we could write down the xref table by default and provide a flag
> to not write it. Sounds good for you?

Yes.  But are there other important differences with FDF? (having to
specify a stream's length with a direct object could be significant).
Maybe the client should just specify PDF or FDF.

>     - If file==NULL, do we try to keep the same inode number?
> 
> Hm? The file associated with the document is a pdf_fsys open file
> implemented by some fsys filesystem, so the existence of inode numbers
> is a mere hipothesys at this level... I am not sure if I understand
> the question.

To put it another way, do we overwrite the file in place via the open FD
when file==NULL?  This would mean we'd keep the same permissions and
inode number, and hard links would still point to that file (it would
also mean that if the original file had been unlinked, it wouldn't be
recreated -- it could only be accessed through open FDs or hard links).
I'd expect this if no filename was given.

If file!=NULL and the original had been unlinked, I'd expect it to be
recreated (with a new inode number); otherwise, it may or may not keep
the same inode number, permissions, etc., depending on how it's written.

Having the application pass a stream would be more flexible, since it
could open the FD with whatever flags it wanted (and handle errors
appropriately).

>     - Should there be a separate GC function instead of a flag?
> 
> Hm, I like this suggestion.
> 
> The GC phase is usually invoked just before to save the document, but
> a separate call would be quite convenient. Anyway, at the same time I
> would keep the flag for doc_save.

The function you suggested is fine.  I'm not sure what the benefit of a
flag is, but it's not a problem.

>     - Why does the client need to care about the header string?
>        - If they specify an old version, do we ensure the output is
>          compatible with that version?  What do we do with unrecognized
>          versions?
>        - I think we should let the user specify some kind of compatibility
>          level, and let the library handle details like this.
> 
> I believe that the compatibility check should be done at the document
> layer level. The client may want to create PDF-like files to store a
> collection of objects (as a temporary storage) or something like that.

There's one syntactic difference that might need to be passed down to a
lower level: PDF 1.2 and later encode '#' as "#23" in a name, but
earlier versions just wrote "#" (and don't support any characters that
would require escaping).

>     - How do we track modifications when _SAVE_FULL is not used?
> 
> Hm, I think that a SAVE_FULL implies to discard any previous content
> in the PDF file.

Yes, I was asking about the other case (it's an implementation detail,
but it may help to consider this in the API design).

>    [3.2.3]
>    pdf_obj_doc_get_id
>     - Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t
>       with string type.
> 
> But then that object would be associated with a document.

Is that a problem?  The IDs are actually string objects (though they may
be indirect).

The original version is OK too -- but if you keep that version, you
should specify whether it copies the strings.  The behaviour when no IDs
are present should also be specified (null pointers could be returned in
this case -- PDF_ERROR isn't good enough, since then it wouldn't be
possible to detect things like I/O errors).

>    pdf_obj_doc_set_dirty
>        "If the dirty flag of a document is set then its contents are saved
>         before the document is closed."
>     - Does this mean doc_close will write a file?  That doesn't seem right.
> 
> The dirty flag is useful to track if the document needs
> saving.

True, if we automatically set it when the document changes (although the
application should be able to implement it easily too).  Will we do
that?  It would require collections to set the document's dirty flag
whenever they are modified.

> What is wrong with implicitly saving the document in doc_close before
> to actually close the document?

It potentially makes error-handling difficult.  In particular, with
certain errors we'll want to keep the document handle valid so the
client can retry the operation (e.g. with ENOSPC or EWOULDBLOCK).
doc_close is also likely to be called from a destructor when working in
higher-level languages, and error handling is difficult in destructors.

It's not documented in doc_close either.

>    [3.3.2]
>    pdf_obj_copy
>     - If copy_indirect is false, what happens when an indirect object is
>       seen?
> 
> What about:
> 
> @item PDF_FALSE 
> The indirectly referenced objects in @var{source} are
> not copied to @var{dest_doc}. This means that the copied object will
> lack these entries.

I don't really like this.  Wouldn't it destroy the meaning of the
collection to remove those elements?  Can you give an example of why
this would be useful?

Maybe the copy should just fail if an indirect object is seen (assuming
we need to provide this option at all).

>    pdf_obj_destroy
>        "This function is a nop if obj is a direct scalar type or the Null
>         object."
>     - Why would it be a nop?  Scalar objects need to be freed too.
> 
> My idea is to implement a pdf_obj_t variable as a struct like:
> 
> struct pdf_obj_s
> {
>   int type;
> 
>   union {
>     pdf_i32_t number;
>     pdf_real_t real;
>     void *pointer; /* For direct containers and indirect objects */
>   } value;
> };
> 
> typedef union pdf_obj_s pdf_obj_t;
> 
> Then, direct scalar types (and the null object) would be allocated in
> the stack. Thus pdf_obj_destroy would be a nop. Anyway I agree in that
> it may not be wise to make the client aware of that implementation
> detail.

OK.  Is there a particular reason to optimise numeric types at the
expense of other types? (if pdf_obj_t isn't a pointer, it will be more
expensive to pass around).

>    pdf_obj_enum
>     - Non-iterable types should return an error.
> 
> We could return PDF_FALSE in that case.

PDF_FALSE and PDF_OK are both 0.  I think EBADDATA or EINVOBJ would make
sense.

>     - Why is a callback used?  This will often be annoying to work with,
>       especially given C's lack of closures (users will need to define and
>       populate a struct, cast client_data to it, etc.).  It will also make
>       interactions with other languages more difficult, e.g. if a client
>       wants to throw an exception.
> 
> The alternative would be to use iterators... . Callbacks tend
> to be somewhat inconvenient. A 'pdf_obj_iterator_t' would make it, or
> we could even use a pdf_list_t to return a list of pointers to
> objects, and thus we could use pdf_list_iterator_t. What do you think?

Like pdf_list_t, I think it should be possible to iterate over a
collection without allocating heap memory.  We could use a similar
iterator structure, but I wouldn't want to allocate a new pdf_list_t.

The types will probably use pdf_list_t internally anyway, so the
iterators can just be wrappers around the pdf_list_iterator_t functions
(but the dictionary iterator could return a key and a value).

>    pdf_obj_get_doc
>     - PDF_EBADDATA could be used instead of PDF_EINVOBJ (elsewhere too).
> 
> PDF_EINVOBJ is more explicit, but to use PDF_EBADDATA would be more
> coherent with the rest of the API. I don't have strong feeling about
> it.

I think either way is fine too -- but if PDF_EINVOBJ isn't really needed
anywhere, removing it would simplify the interface slightly.

>    pdf_obj_get_id
>     - When is an object ID assigned?  If the user creates an indirect
>       object, but hasn't saved the file, do we already know the ID?
> 
> Yes. An object ID is allocated by the document in object-creation
> time.

Will it cause any problems assigning the numbers this early?  For
example, linearization has these requirements:
  "Objects stored within object streams shall be given the highest range
   of object numbers within the main and first-page cross-reference
   sections."
  "The first group ... shall be numbered sequentially, starting at the
   first object number after the last number of the second group."
  [PDF32000 F.3.1]

Either the client would need to create their objects sequentially, or
we'd need to renumber them when saving (which could confuse a client if
it had saved some old IDs).

>    [3.3.5]
>     - The document doesn't explain what object collections are, but you
>       stated in your email:
>    >   The object documents support the notion of "object collections",
>    >   that are eventually translated into one or more linked object
>    >   streams.
> 
>     - Why would the client need to manage them?
> 
> Because the underlying object streams are useful to store related
> objects that will need each other. The object layer does not have
> enough information to decide what objects are "related" and thus be
> able to maintain the locality principle.
> 
> For instance, the document layer may decide to store all the objects
> containing the contents of a single page in an "object collection",
> that will translate into one (or several linked) compressed object
> stream.

That's required for linearized documents:
  "The first group shall consist of the document catalogue, other
   document-level objects, and all objects belonging to the first page
   of the document." [PDF32000 F.3.1]

The garbage collector will need to track down object references -- I
wonder if we could do something similar to find the appropriate object
relationships.  Maybe rather than adding individual objects, we could
mark an array/dict as belonging to a certain set of objects, and have
all objects it references become part of that set.

Another option would be to store every indirect object in a collection
(rather than giving a pointer to the document, you'd specify a pointer
to the collection when creating an indirect object).  The document would
then have a set of collections.

>    [3.3.6 -- 3.3.14]
> 
>     - Why does each _new method have an indirect flag?  Another option
>       would be to let the user create a direct object, and then create a
>       reference to it when necessary (and we could also turn direct objects
>       into indirect objects when writing, where the spec requires it).
>     - If an indirect object is created, will the pdf_obj_*_value methods
>       resolve the reference to return the value?
>       (if so, my comments on pdf_obj_get_type apply here too)
> 
> The indirect flag is an indication to pdf_obj_*_new to "register" the
> object into the document's xref.
> 
> The pdf_obj_*_value would not require to resolve the reference: the
> object structure should contain its value (scalar) or a pointer to it
> (non-scalar).

Right, I was confusing indirect objects with object references here.
But we do seem to be missing a function that resolves references
(i.e. gets the indirect object associated with a PDF_OBJ_REF object).

>    pdf_obj_null_new
>     - Why is the null constructor different from the rest?
> 
> There is only one null object (conceptually speaking) and it cannot be
> indirect.
> 
>    pdf_obj_boolean_new
>     - It seems silly to associate a boolean with a document, or allow
>       indirect boolean objects if indirect nulls aren't possible.
> 
> It is perfectly possible to create an indirect boolean value in a
> document, such as:
...
> On the other hand, the null object cannot be indirect.

Why?  7.3.10 in PDF32000 states "Any object in a PDF file may be
labelled as an indirect object".

Conceptually speaking, we could also say that there's only one 'true'
object and one 'false' object (and perhaps one null string object, one
'0' object, etc.).  We could optimise the code so it returns a pointer
to a static object in these cases -- but associating state with each
object (e.g. directness, parent document, array/dict membership,
compressibility) may preclude this.

>    pdf_obj_name_value, _string_value
>     - It should be possible to get at the value without copying it.
> 
> Using a const pointer, you mean?

Yes.  Allocating memory means the client must deal with ENOMEM, and must
also free the memory; a pointer would be simpler.  (Though for some
cases, it may be convenient to have a method that returns a copy too.)

>    pdf_obj_string_hex_p, _hex_set
>     - I don't see any reason for this flag -- it's trivial to decide
>       between escaped or hex strings when writing, and there should be no
>       reason to care when reading.
> 
> How would you decide? Both representations have its advantages, and
> the client should decide which one to use.

Actually, I don't see where hex strings would ever be needed -- they're
always at least as long as non-hex strings, if we don't mind including
binary characters directly.  Should we have a mode that encodes
everything as ASCII?

If we want to escape non-ASCII characters, we can just see which would
be shorter (an octal escape in a non-hex string takes 4 bytes, and the
same character in a hex string would take 2).  Both types of strings
will be parsed into identical objects, and it's such a low-level detail
I doubt any client would change this flag.

Also, if the object layer is responsible for encrypting the document,
the string type should be chosen based on the encrypted text, which the
client would have no way of knowing.  pdf_obj_doc_save_params_s defines
some encryption parameters, but decryption info seems to be missing.

>    pdf_obj_stream_new
...
>     - source_length is redundant since attrs_dict must contain the length
>       (if we have both, we'll have to explain which takes precedence and
>       whether attrs_dict will be modified by the library to add a /Length
>       element).
> 
> Note that the base stream @var{stm} contains unfiltered data, and that
> @var{source_length} is the length in octects of the unfiltered
> data. If the stream dictionary contains a filter chain, we usually
> does not know the length of the filtered data until we have actually
> filtered it. The Length entry in @var{attrs_dict} is the lenght of the
> filtered data.

That makes sense, but the pdf_obj_stream_length function could be
renamed to make it more clear that it's not the same thing.

Will the value always be known?  PDF32000 table 5 states for /DL, "for
some stream filters, it may not be possible to determine this value
precisely".  Why not just read until EOF?

>     - Maybe it should be possible to pass a callback function that will
>       open and return a stream when necessary (since it may be impractical
>       to open all streams ahead of time).
>        - Or this type of callback could be handled at the stream layer, by
>          creating a dummy stream type that executes callbacks to do the
>          real work.
> 
> Since 'pdf_obj_stream_new' internally creates an intermediate
> representation of the stream in disk, @var{stm} is used just to get
> the contents. You can get a reading stream using
> 'pdf_obj_stream_open_stm'.

I don't really like the idea of the library creating temporary files on
its own.  Opening a file in a library can cause security issues, for
example:
  http://udrepper.livejournal.com/20407.html
(Linux 2.6.27 is needed to protect against this, and I'm sure there are
operating systems without this feature.)

It will also waste time and space if the stream is already a file or
could be generated on the fly.

-- Michael

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]