pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[pdf-devel] String common data structures and charset conversions


From: Aleksander Morgado
Subject: [pdf-devel] String common data structures and charset conversions
Date: Thu, 4 Oct 2007 10:07:02 +0200

Hi all,

I've been looking at the implementation of string objects in pdf_obj.h, and trying to understand what should be done for the specific charset conversions in the pdf library. I also checked the way charset conversions are done in libc (iconv), and the standard wide-character wchar_t type.

First of all, I understand that all common data structures related to strings can be implemented using the string object type (pdf_string_t), as they only differ in the encoding being used. Is this true? Should we use a modified pdf_string_t to include not only the text information, but also the encoding being used?

Is the library going to use an internal and unique charset to store every text information? GNU C library uses always UCS-4 encoded ISO 10646 in its wchar_t (covering the whole Unicode set). This means that 32 bits will always be used for every encoded character, even if the majority of characters are included in the Basic Multilingual Plane, which only needs three bytes. I don't know what you think, but it seems a waste of memory, even if it's easier to use due to the fixed size of each code point. The widely used UTF-8 or even UTF-16 could be better options.


The charset conversions needed for the PDF library are basically these:
- PDFDocEncoding to UTF-16BE
- UTF-16BE to PDFDocEncoding (with loss of information in some cases)
- UTF-16BE to ASCII (with loss of information in some cases)
- ASCII to UTF-16BE
- PDFDocEncoding to ASCII (with loss of information in some cases)
- ASCII to PDFDocEncoding

I understand that some of these conversions may cause a loss of information, due to the fact that ASCII and PDFDocEncoding don't cover the whole Unicode. How will be this handled in the library?

GNU libiconv, which I think is also included in GNU libc, covers at least conversions from ASCII to UTF-16BE and vice versa. I also understood after reading the documentation of this project that adding new charset conversions is quite easy (for PDFDocEncoding), so this could be a good candidate to include in GNU PDF library.

And the last question... when deciding which glyph to use to display a given character, which encoding is used? I mean, how are the 'glyph databases' of each font type stored?

Thanks and regards,

Aleksander Morgado


reply via email to

[Prev in Thread] Current Thread [Next in Thread]