Re: [pdf-devel] String common data structures and charset conversions

pdf-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] String common data structures and charset conversions

From:	jemarch
Subject:	Re: [pdf-devel] String common data structures and charset conversions
Date:	Sun, 07 Oct 2007 20:36:36 +0200
User-agent:	Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.50 (powerpc-unknown-linux-gnu) MULE/5.0 (SAKAKI)

   First of all, I understand that all common data structures related to
   strings can be implemented using the string object type (pdf_string_t), as
   they only differ in the encoding being used. Is this true? 

Yes, it is. The base object is a pdf_string_t, that contain a unsigned
byte string (without restrictions). The several string common data
types impose restrictions in the interpretation of the content of the
strings. 

   Should we use a modified pdf_string_t to include not only the text
   information, but also the encoding being used?

If all what we need is a type selector field in the pdf_string_t
structure then I find it reasonable to put the extra field directly in
pdf_string_t.

Is a type selector (that indicates how to interpret the contents of
the string) enough?

   Is the library going to use an internal and unique charset to store every
   text information? GNU C library uses always UCS-4 encoded ISO 10646 in its
   wchar_t (covering the whole Unicode set). This means that 32 bits will
   always be used for every encoded character, even if the majority of
   characters are included in the Basic Multilingual Plane, which only needs
   three bytes. I don't know what you think, but it seems a waste of memory,
   even if it's easier to use due to the fixed size of each code point. The
   widely used UTF-8 or even UTF-16 could be better options.

I am not sure about the convenience of using a unique CCS+encoding to
store every text information. The interpretation of the text strings
in the PDF file varies depending of the specific operation and
situation: sometimes the local encoding (determined by the OS) is
used. 

   The charset conversions needed for the PDF library are basically these:
   - PDFDocEncoding to UTF-16BE
   - UTF-16BE to PDFDocEncoding (with loss of information in some cases)
   - UTF-16BE to ASCII (with loss of information in some cases)
   - ASCII to UTF-16BE
   - PDFDocEncoding to ASCII (with loss of information in some cases)
   - ASCII to PDFDocEncoding

   I understand that some of these conversions may cause a loss of information,
   due to the fact that ASCII and PDFDocEncoding don't cover the whole Unicode.
   How will be this handled in the library?

Since those conversion are lossy in nature, I think it wont be a
problem: it is expected to loss information. 

   GNU libiconv, which I think is also included in GNU libc, covers at least
   conversions from ASCII to UTF-16BE and vice versa. I also understood after
   reading the documentation of this project that adding new charset
   conversions is quite easy (for PDFDocEncoding), so this could be a good
   candidate to include in GNU PDF library.

Please make sure about how GNU libiconv works and study the
possibility of using it. Keep in mind that portability is an important
issue for this library. We want to use the GNU libc goodies if we are
running in GNU systems, but we should not depend of these capabilities
in order to be able to run in other plattforms with other libc
implementations.

   And the last question... when deciding which glyph to use to display a given
   character, which encoding is used? I mean, how are the 'glyph databases' of
   each font type stored?

It depend of the type of the specific font used in each
situation. Each font dictionary provides an array that maps from
character codes to glyph description dictionaries. Anyway, it should
not be a concern now: what we need now is an implementation of the
several CSS+encoding you listed above and conversion between those
CSS+encodings. 

--
Jose E. Marchesi  <address@hidden>
                  <address@hidden>

GNU Spain         http://es.gnu.org
GNU Project       http://www.gnu.org

[Prev in Thread]

Current Thread

[Next in Thread]

[pdf-devel] String common data structures and charset conversions, Aleksander Morgado, 2007/10/04
- Re: [pdf-devel] String common data structures and charset conversions, jemarch <=

Prev by Date: Re: [pdf-devel] Room for a newcomer ?
Next by Date: Re: [pdf-devel] pdf functions, pdf_create_stream
Previous by thread: [pdf-devel] String common data structures and charset conversions
Next by thread: [pdf-devel] pdf functions, pdf_create_stream
Index(es):
- Date
- Thread