pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Object layer API


From: jemarch
Subject: Re: [pdf-devel] Object layer API
Date: Wed, 18 Feb 2009 23:21:41 +0100
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)

   >    Can you post a version online that doesn't contain so many pages?
   >    The object layer chapter alone contains over 20 pages, which makes
   >    it very annoying to read -- a single downloadable file would be
   >    nice.
   > 
   > I though that the info format is quite convenient. You can generate an
   > info file and navigate into it using Emacs or any info reader, and you
   > can send suggested improvements as patches.

   I don't have a problem with the info format, but I find documents with
   lots of tiny pages annoying for online viewing (especially when the
   connection latency is high).  Many GNU manuals are available in multiple
   formats -- see the GNU tar manual for example:
     http://www.gnu.org/software/tar/manual/index.html

   Something like that would be nice for GNU PDF.  I could generate other
   forms of documentation myself, but I guess other people would also find
   it find it useful to have a choice of formats for viewing/downloading.

   I can look into autogenerating the other formats if you're willing to
   put them online.

I am working on it: it is possible to generate all the formats in an
automatic way.

   >    pdf_obj_doc_open
   >     - Would it be better to take an open stream, rather than a filesystem
   >       and path?
   > 
   > Why? PDF documents are stored in files, one document per file,
   > regardless the kind of file system providing the file.

   I'm not sure exactly, but it seems more general.  Maybe there are
   certain workflows for which the application would want to open a PDF
   file from an FD instead of a filename (e.g. to avoid a race condition,
   or to read from an FD passed over a Unix socket).

The main reason for pdf_obj_doc_open to operate on pdf_fsys_file_t
objects and not directly on a pdf_stm_t is that the object document
implementation should be able to open more than one base stream
reading from the pdf_fsys_file.

This is quite convenient: the object document may be using an
unfiltered base stream to parse the objects at the top level (calling
the parser, that in turn uses the provided base stream) and then
create a new filtered base stream on the same document to read the
contents of a filtered stream object, for instance. Also, the object
document implementation has access to the properties and functions of
both the filesystem and the open file.

However, to change 'pdf_obj_doc_open' to accept a pdf_fsys_file_t
directly instead of a pair pdf_fsys_t+path seems ok to me, given that
we also provide a convenience function 'pdf_obj_doc_open_path' that
will cover most of the practical cases.

   >     - What's the point of header_string?
   >        - The client probably doesn't care about the header -- they just
   >          want the library to open a valid PDF.
   > 
   > The user may want to open non-pdf file such as a FDF file, that uses
   > the "%FDF-" header. Also, we may want to introduce new headers such as
   > "%GNUPDF-".

   Why would we introduce a new header?

Why not? We may want to introduce new pdf-like formats, complementary
to PDF (like FDF does).

   For the standard headers, the client could check using the
   pdf_obj_doc_get_header you proposed, but that still seems like a
   low-level detail they shouldn't have to deal with.

We may also introduce a pdf_obj_doc_get_type returning a value from an
enumeration of standard headers, including also an "Unknown" value. In
the latest case the user can then use 'pdf_obj_doc_get_header' to get
the header string.

If the user does not care about the header she can pass NULL in the
expected extension parameter.

   >    pdf_obj_doc_save
   >     - Why wouldn't you want an xref table?  It should probably be written
   >       by default.
   > 
   > The cross reference table in FDF files is optional, for instance.
   > But we could write down the xref table by default and provide a flag
   > to not write it. Sounds good for you?

   Yes.

I changed the documentation for pdf_obj_doc_save to create the xref
table by default and the PDF_OBJ_SAVE_CREATE_CR has been renamed to
PDF_OBJ_SAVE_DONT_CREATE_CR.

   >     - If file==NULL, do we try to keep the same inode number?
   > 
   > Hm? The file associated with the document is a pdf_fsys open file
   > implemented by some fsys filesystem, so the existence of inode numbers
   > is a mere hipothesys at this level... I am not sure if I understand
   > the question.

   To put it another way, do we overwrite the file in place via the open FD
   when file==NULL?  This would mean we'd keep the same permissions and
   inode number, and hard links would still point to that file (it would
   also mean that if the original file had been unlinked, it wouldn't be
   recreated -- it could only be accessed through open FDs or hard links).
   I'd expect this if no filename was given.

   If file!=NULL and the original had been unlinked, I'd expect it to be
   recreated (with a new inode number); otherwise, it may or may not keep
   the same inode number, permissions, etc., depending on how it's written.

It depends on the pdf_fsys_t used, but I believe that the answer to
both questions is _yes_ if using the default (disk) filesystem.

   >     - Should there be a separate GC function instead of a flag?
   > 
   > Hm, I like this suggestion.
   > 
   > The GC phase is usually invoked just before to save the document, but
   > a separate call would be quite convenient. Anyway, at the same time I
   > would keep the flag for doc_save.

   The function you suggested is fine.  I'm not sure what the benefit of a
   flag is, but it's not a problem.

The documentation for the new function is now in place in the
reference manual.

   >     - Why does the client need to care about the header string?
   >        - If they specify an old version, do we ensure the output is
   >          compatible with that version?  What do we do with unrecognized
   >          versions?
   >        - I think we should let the user specify some kind of compatibility
   >          level, and let the library handle details like this.
   > 
   > I believe that the compatibility check should be done at the document
   > layer level. The client may want to create PDF-like files to store a
   > collection of objects (as a temporary storage) or something like that.

   There's one syntactic difference that might need to be passed down to a
   lower level: PDF 1.2 and later encode '#' as "#23" in a name, but
   earlier versions just wrote "#" (and don't support any characters that
   would require escaping).

There has been a misunderstanding here, I think: the header string
does not cover the version numbers. Look at the example in the
documentation of pdf_obj_doc_open:

@item header_string
A string like @code{"%PDF-"}.

and in the Usage example of the same function:

   if (pdf_obj_doc_open (PDF_OBJ_DOC_OPEN_DO_REPAIR,
                         NULL, /* Use the default filesystem */
                         file_path,
                         "%PDF-",
                         &doc) != PDF_OK)

Now I realize that I wrote:

   save_params.header = "%PDF-1.7";

in the pdf_obj_doc_save example. Sorry about that. It should be:

   save_params.header = "%PDF-";

I fixed it in the reference manual.

   >    [3.2.3]
   >    pdf_obj_doc_get_id
   >     - Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t
   >       with string type.
   > 
   > But then that object would be associated with a document.

   Is that a problem?  The IDs are actually string objects (though they may
   be indirect).

No, it is not a problem. I agree in that it is convenient. In the case
of an indirect object then pdf_obj_doc_get_id should fetch the
referenced object and return it.

I changed the documentation of the function in the refman.

   >    pdf_obj_doc_set_dirty
   >        "If the dirty flag of a document is set then its contents are saved
   >         before the document is closed."
   >     - Does this mean doc_close will write a file?  That doesn't seem right.
   > 
   > The dirty flag is useful to track if the document needs
   > saving.

   True, if we automatically set it when the document changes (although the
   application should be able to implement it easily too).  Will we do
   that?  It would require collections to set the document's dirty flag
   whenever they are modified.

Yes. We should indicate what functions sets the dirty flag of the
document in their documentation in the refman.

   > What is wrong with implicitly saving the document in doc_close before
   > to actually close the document?

   It potentially makes error-handling difficult.  In particular, with
   certain errors we'll want to keep the document handle valid so the
   client can retry the operation (e.g. with ENOSPC or EWOULDBLOCK).
   doc_close is also likely to be called from a destructor when working in
   higher-level languages, and error handling is difficult in destructors.

I agree.

   > struct pdf_obj_s
   > {
   >   int type;
   > 
   >   union {
   >     pdf_i32_t number;
   >     pdf_real_t real;
   >     void *pointer; /* For direct containers and indirect objects */
   >   } value;
   > };
   > 
   > typedef union pdf_obj_s pdf_obj_t;
   > 
   > Then, direct scalar types (and the null object) would be allocated in
   > the stack. Thus pdf_obj_destroy would be a nop. Anyway I agree in that
   > it may not be wise to make the client aware of that implementation
   > detail.

   OK.  Is there a particular reason to optimise numeric types at the
   expense of other types? (if pdf_obj_t isn't a pointer, it will be more
   expensive to pass around).

Not just numeric types, but also strings. Note that a pdf_obj_t
containing a direct object is just a copy of the value of an object
contained into some container in the document.

   >    pdf_obj_enum
   >     - Non-iterable types should return an error.
   > 
   > We could return PDF_FALSE in that case.

   PDF_FALSE and PDF_OK are both 0.  I think EBADDATA or EINVOBJ would make
   sense.

Sorry, I meant PDF_ERROR.

   >     - Why is a callback used?  This will often be annoying to work with,
   >       especially given C's lack of closures (users will need to define and
   >       populate a struct, cast client_data to it, etc.).  It will also make
   >       interactions with other languages more difficult, e.g. if a client
   >       wants to throw an exception.
   > 
   > The alternative would be to use iterators... . Callbacks tend
   > to be somewhat inconvenient. A 'pdf_obj_iterator_t' would make it, or
   > we could even use a pdf_list_t to return a list of pointers to
   > objects, and thus we could use pdf_list_iterator_t. What do you think?

   Like pdf_list_t, I think it should be possible to iterate over a
   collection without allocating heap memory.

Agreed.

   We could use a similar iterator structure, but I wouldn't want to
   allocate a new pdf_list_t.

   The types will probably use pdf_list_t internally anyway, so the
   iterators can just be wrappers around the pdf_list_iterator_t functions
   (but the dictionary iterator could return a key and a value).

Yes, good point. Then we can make the function to return a
pdf_list_iterator_t directly. I don't see the need to introduce a new
wrapper type.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]