[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [pdf-devel] String common data structures and charset conversions
From: |
jemarch |
Subject: |
Re: [pdf-devel] String common data structures and charset conversions |
Date: |
Sun, 07 Oct 2007 20:36:36 +0200 |
User-agent: |
Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.50 (powerpc-unknown-linux-gnu) MULE/5.0 (SAKAKI) |
First of all, I understand that all common data structures related to
strings can be implemented using the string object type (pdf_string_t), as
they only differ in the encoding being used. Is this true?
Yes, it is. The base object is a pdf_string_t, that contain a unsigned
byte string (without restrictions). The several string common data
types impose restrictions in the interpretation of the content of the
strings.
Should we use a modified pdf_string_t to include not only the text
information, but also the encoding being used?
If all what we need is a type selector field in the pdf_string_t
structure then I find it reasonable to put the extra field directly in
pdf_string_t.
Is a type selector (that indicates how to interpret the contents of
the string) enough?
Is the library going to use an internal and unique charset to store every
text information? GNU C library uses always UCS-4 encoded ISO 10646 in its
wchar_t (covering the whole Unicode set). This means that 32 bits will
always be used for every encoded character, even if the majority of
characters are included in the Basic Multilingual Plane, which only needs
three bytes. I don't know what you think, but it seems a waste of memory,
even if it's easier to use due to the fixed size of each code point. The
widely used UTF-8 or even UTF-16 could be better options.
I am not sure about the convenience of using a unique CCS+encoding to
store every text information. The interpretation of the text strings
in the PDF file varies depending of the specific operation and
situation: sometimes the local encoding (determined by the OS) is
used.
The charset conversions needed for the PDF library are basically these:
- PDFDocEncoding to UTF-16BE
- UTF-16BE to PDFDocEncoding (with loss of information in some cases)
- UTF-16BE to ASCII (with loss of information in some cases)
- ASCII to UTF-16BE
- PDFDocEncoding to ASCII (with loss of information in some cases)
- ASCII to PDFDocEncoding
I understand that some of these conversions may cause a loss of information,
due to the fact that ASCII and PDFDocEncoding don't cover the whole Unicode.
How will be this handled in the library?
Since those conversion are lossy in nature, I think it wont be a
problem: it is expected to loss information.
GNU libiconv, which I think is also included in GNU libc, covers at least
conversions from ASCII to UTF-16BE and vice versa. I also understood after
reading the documentation of this project that adding new charset
conversions is quite easy (for PDFDocEncoding), so this could be a good
candidate to include in GNU PDF library.
Please make sure about how GNU libiconv works and study the
possibility of using it. Keep in mind that portability is an important
issue for this library. We want to use the GNU libc goodies if we are
running in GNU systems, but we should not depend of these capabilities
in order to be able to run in other plattforms with other libc
implementations.
And the last question... when deciding which glyph to use to display a given
character, which encoding is used? I mean, how are the 'glyph databases' of
each font type stored?
It depend of the type of the specific font used in each
situation. Each font dictionary provides an array that maps from
character codes to glyph description dictionaries. Anyway, it should
not be a concern now: what we need now is an implementation of the
several CSS+encoding you listed above and conversion between those
CSS+encodings.
--
Jose E. Marchesi <address@hidden>
<address@hidden>
GNU Spain http://es.gnu.org
GNU Project http://www.gnu.org