[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [pdf-devel] Object layer API
From: |
Michael Gold |
Subject: |
Re: [pdf-devel] Object layer API |
Date: |
Fri, 13 Feb 2009 03:09:28 -0500 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
On Fri, Feb 06, 2009 at 00:49:36 +0100, address@hidden wrote:
>
> Hi.
>
> I just committed in the trunk the first draft of the object layer
> public API. It is documented in the "Object Layer" chapter of the
> reference manual.
Can you post a version online that doesn't contain so many pages?
The object layer chapter alone contains over 20 pages, which makes it
very annoying to read -- a single downloadable file would be nice.
Comments on specific items follow.
[3.2.2]
pdf_obj_doc_new
- Why not always create an info dictionary by default?
- The "&doc" parameter should be "*doc".
pdf_obj_doc_open
- Would it be better to take an open stream, rather than a filesystem
and path?
- What's the point of header_string?
- The client probably doesn't care about the header -- they just
want the library to open a valid PDF.
- There may be more than one acceptable string, e.g.
"%PDF-" and "%!PS-Adobe-N.n PDF-"
- There's no mention of non-blocking I/O (this applies to other
functions too). If we'll allow people to open PDF files over HTTP
(or using other slow methods), we'll need some way to handle the case
where we need to wait for data -- we may need to return an incomplete
document object, since we might read the header and then get
EWOULDBLOCK when trying to read the trailer or xref table(s).
- Maybe we should have one function to create the pdf_obj_doc_t, and
another to initialize it from a stream. The initialization
function could return EWOULDBLOCK when necessary, and be called
again when more data is available.
pdf_obj_doc_close
- Wouldn't the handle become invalid on PDF_OK also?
pdf_obj_doc_save
- Why wouldn't you want an xref table? It should probably be written
by default.
- Why should _SAVE_FULL imply _SAVE_DO_GC? Both flags could be given
if desired.
- If file==NULL, do we try to keep the same inode number? Is the file
corrupted after an error?
- Should there be a separate GC function instead of a flag?
- Again, there should be some consideration for non-blocking I/O.
- Why does the client need to care about the header string?
- If they specify an old version, do we ensure the output is
compatible with that version? What do we do with unrecognized
versions?
- I think we should let the user specify some kind of compatibility
level, and let the library handle details like this.
- How do we track modifications when _SAVE_FULL is not used?
[3.2.3]
pdf_obj_doc_get_id
- Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t
with string type.
pdf_obj_doc_set_dirty
"If the dirty flag of a document is set then its contents are saved
before the document is closed."
- Does this mean doc_close will write a file? That doesn't seem right.
[3.3.1]
pdf_obj_gen_t
- Should be described as "non-negative", not "positive".
[3.3.2]
pdf_obj_copy
- If copy_indirect is false, what happens when an indirect object is
seen?
- Where does the object go? I.e., if you throw away whatever's
returned in *dest, does dest_doc still have some reference to it?
- If an object didn't have a back-reference to a document, something
like pdf_obj_dup would be sufficient.
- If a stream is copied, does closing the original file invalidate it?
(if not, where's the data stored?)
pdf_obj_destroy
"This function is a nop if obj is a direct scalar type or the Null
object."
- Why would it be a nop? Scalar objects need to be freed too.
pdf_obj_enum
- Non-iterable types should return an error.
- Iterating over a stream doesn't make any sense. If the client wants
to iterate over its dictionary, they should provide the dictionary
object.
- Will iteration stop as soon as PDF_FALSE is returned?
- Why is a callback used? This will often be annoying to work with,
especially given C's lack of closures (users will need to define and
populate a struct, cast client_data to it, etc.). It will also make
interactions with other languages more difficult, e.g. if a client
wants to throw an exception.
pdf_obj_equal
- Should be "pdf_obj_equal_p".
"If they are indirect, they share the same generation number."
- We should also verify that they point to the same object.
"If they are non-scalar, they reference the same value. "
- What does "the same value" mean? Does obj1 have to be the same
pointer as obj2, or just point to a similar array/dict?
- Will we compare stream contents? This would be expensive.
pdf_obj_get_doc
- PDF_EBADDATA could be used instead of PDF_EINVOBJ (elsewhere too).
pdf_obj_get_type
"The returned value should be interpreted as a enum pdf_obj_type_e."
- Why not use that type as the return value?
- enum pdf_obj_type_e doesn't include an INDIRECT type. Do we return
the type of whatever it references? If so,
- What if we need to seek to find it, and get an error?
- What if the indirect object points to another indirect object?
How many levels do we go before giving up?
pdf_obj_get_id
- When is an object ID assigned? If the user creates an indirect
object, but hasn't saved the file, do we already know the ID?
pdf_obj_compressed_p
- Why would the client care about this?
- Why does the object need to know where it came from? I'd expect this
(i.e. the location of an object identified by an ID and generation)
to be a property of the document.
[3.3.3]
- Why would we need to create strong references? Normally the document
will hold a reference (directly or indirectly) to each object, and if
the client wants the library to write an unused object, it can avoid
garbage collection.
[3.3.4]
- Why can the client force incompressible objects? Can't the library
determine the appropriate way to write out each object?
- Why can't we compress an object with a generation number of zero? We
could change it to zero when writing the file if we assign a new
object ID.
[3.3.5]
- The document doesn't explain what object collections are, but you
stated in your email:
> The object documents support the notion of "object collections",
> that are eventually translated into one or more linked object
> streams.
- Why would the client need to manage them?
[3.3.6 -- 3.3.14]
pdf_obj_*_new
- I agree with gerel that it doesn't seem right to require every object
be associated with a document. I'm not sure there should be a
pointer from the object to the document at all -- what would it be
used for? The document can find its objects by following references
from its root.
- Why does each _new method have an indirect flag? Another option
would be to let the user create a direct object, and then create a
reference to it when necessary (and we could also turn direct objects
into indirect objects when writing, where the spec requires it).
- If an indirect object is created, will the pdf_obj_*_value methods
resolve the reference to return the value?
(if so, my comments on pdf_obj_get_type apply here too)
pdf_obj_null_new
- Why is the null constructor different from the rest?
pdf_obj_boolean_new
- It seems silly to associate a boolean with a document, or allow
indirect boolean objects if indirect nulls aren't possible.
pdf_obj_name_new
- Presumably the name is null-terminated, so this should be stated.
pdf_obj_name_value, _string_value
- It should be possible to get at the value without copying it.
pdf_obj_string_hex_p, _hex_set
- I don't see any reason for this flag -- it's trivial to decide
between escaped or hex strings when writing, and there should be no
reason to care when reading.
arrays/dicts
- What's a weak reference? Is it related to strong references (3.3.3)?
If so, it's redundant, and otherwise it's confusing.
- How is object ownership handled? Do the _remove methods delete the
removed objects? (does it depend on whether it's weak?) Do _get and
_put copy the object?
- When writing the parser, I found it was reasonable for containers
to own the objects inside them (i.e. delete them when an object is
removed, or when the container is destroyed). The only difficulty
was when transforming a collection, e.g. creating a dict based on
an array: each item is temporarily owned by both containers, but
they shouldn't be deleted when destroying the original container.
pdf_obj_array_insert
- Clarify that index==pdf_obj_array_size() inserts at the end.
- I'm not sure why an object would know whether it's in a container,
but PDF_EINVOBJ implies it does.
pdf_obj_array_size
- How are errors handled? If we're resolving indirect refs, they'd
need to be returned; otherwise errors are only possible when the
object isn't an array, and we can leave that case undefined or return
zero.
pdf_obj_array_remove
- Is this useful? (maybe adding _find and using _remove_at would be
better, but I'm not sure why you'd want to search within an array.)
pdf_obj_stream_new
- There's no reason for indirect_p if it has to be PDF_TRUE.
- source_length is redundant since attrs_dict must contain the length
(if we have both, we'll have to explain which takes precedence and
whether attrs_dict will be modified by the library to add a /Length
element).
- Maybe it should be possible to pass a callback function that will
open and return a stream when necessary (since it may be impractical
to open all streams ahead of time).
- Or this type of callback could be handled at the stream layer, by
creating a dummy stream type that executes callbacks to do the
real work.
pdf_obj_stream_length
- "length" should be a pointer.
[general]
The usage of "_get_" seems inconsistent, e.g. pdf_obj_get_type vs.
pdf_obj_integer_value.
The PDF spec says a reference to an object that doesn't exist is to be
treated as a reference to null. Can we avoid assigning an ID that is
unused, but is referenced in this way? (since the reference would then
be pointing at some random object)
-- Michael
signature.asc
Description: Digital signature
Re: [pdf-devel] Object layer API,
Michael Gold <=
- Re: [pdf-devel] Object layer API, jemarch, 2009/02/16
- Re: [pdf-devel] Object layer API, Michael Gold, 2009/02/17
- Re: [pdf-devel] Object layer API, gerel, 2009/02/17
- Re: [pdf-devel] Object layer API, Michael Gold, 2009/02/18
- Re: [pdf-devel] Object layer API, gerel, 2009/02/18
- Re: [pdf-devel] Object layer API, Michael Gold, 2009/02/18
Re: [pdf-devel] Object layer API, jemarch, 2009/02/18