pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] Comments to the Encoded Text API


From: Aleksander Morgado
Subject: [pdf-devel] Comments to the Encoded Text API
Date: Tue, 15 Jan 2008 12:09:35 +0100

Hi all,

I have a couple of comments after having checked the API for text
management (http://gnupdf.org/manuals/gnupdf.html#SEC19).

1. I don't think pdf_text_utf16_val_t, pdf_text_utf32_val_t,
pdf_text_utf8_val_t and pdf_text_unicode_char_t data types are really
required, at least in the API. When I implemented conversions to/from
unicode I defined types like those, but finally decided to skip them
as they were not really useful.

2.  Country code and Language code can appear anywhere in a UTF16BE
encoded PDF string. This means that for a given text more than one
country code or language code can appear within the data.
2.a) A first approach would be to define internally the country and
language code delimiters as end of text markers, so that every
pdf_text_t handles a single country code/language code. Functions
involving the creation of pdf_text_t variables from UTF16BE strings
would need a simple loop to convert the input string chunk by chunk,
and extra parameters in the API function, something like:
pdf_text_t pdf_text_new_from_pdf(const char *str, const pdf_size_t
length, char**remaining, pdf_size_t *remaining_length);
In this case, if (*remaining_length) is zero, the iteration will
conclude; if not, a second call to pdf_text_new_from_pdf would be
needed to create another pdf_text_t with the data starting in
(*remaining). Using the same function for UTF16BE encoded strings and
PDFDocEncoding encoded strings is not a problem: to decide wether an
input string is encoded in UTF16BE or PDFDocEncoding, both the Byte
Order Marker for UTF16BE (U+FEFF) and the country/language code
delimiter (U+001B) will be used (the first one will appear in the
start of every UTF16BE string, and the second one in any UTF16BE chunk
after the first one if country/language information is available).
PDFDocEncoded strings won't have any country/language code associated,
so there won't be any need to split in different pdf_text_t the input
data.
2.b)  Another approach would be to store a list of country/language
codes within the pdf_text_t, not only a single pair. This would need
extra information for each country/language code, specifying the place
in the string where it starts. But this second approach would imply a
more difficult access to the country/language information.

3. I see the need for an extra parameter specifying the length of the
data array given as input or output in the following functions:
 * pdf_text_new_from_unicode (length of input data array is needed, as
UTF encodings can have NUL bytes within the string).
 * pdf_text_get_host (length of output data array is needed, as this
function can involve UTF encodings with NUL bytes within the string)
 * pdf_text_get_unicode (length of output data array is needed, as UTF
encodings can have NUL bytes within the string)
 * pdf_text_set_host (length of input data array is needed, as this
function can involve UTF encodings with NUL bytes within the string)
 * pdf_text_set_unicode (length of input data array is needed, as UTF
encodings can have NUL bytes within the string)

4. In the same way, size doesn't seem to be needed in
pdf_text_set_pdf, as PDFDocEncoding should not have any NUL byte
different than the end of string marker.

5. pdf_text_get_best_encoding function will need specific system
functions to get the range of unicode covered by each host encoding,
and if no such function is available in a given operating system, a
default unicode encoding will be returned.

6. In function pdf_text_new_from_u32, the comment about leading zeros
I think is useless. If leading zeros are included in an integer
initialization the compiler will assume that the value is given in
octal scale, not decimal, so this may be confusing.

7. An additional function like pdf_text_clear(pdf_text_t text) is
needed to free any allocated memory in the variable initializations,
to really treat pdf_text_t as a black box.


What do you think?

Aleksander




reply via email to

[Prev in Thread] Current Thread [Next in Thread]