[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] unicode support - questions
From: |
Werner LEMBERG |
Subject: |
Re: [Groff] unicode support - questions |
Date: |
Wed, 25 Jan 2006 09:05:04 +0100 (CET) |
> > until we have real 32bit input slots
>
> I'm not sure what you expect here. The way it's currently done is that
> characters with name "uNNNN" are used.
[Please read section
`gtroff' internals
for more information too.]
There are different levels.
. The first level is input characters -- a future groff version
shall expect UTF-8 which is stored internally as 32bit values
(element `c' in class `token', assigned in function
`token::next').
Currently, this is an `unsigned char', and all values derived from
it (hyphenation, for example) are looking the same.
. Next, still on the input level, we have the entity names GNU troff
uses for further processing, `A:', `uNNNN', etc. This is
represented by the `charinfo' class, eventually collected with
function `environment::add_char' to form the current line.
Again we have a bottleneck because of the simplistic
`charset_table' array, mapping from input characters to the
`charinfo' class, which expects 256 elements.
. The topmost input level is class `token' which represents all data
possible on the input side -- this includes both processed (for
example, diversions) and unprocessed data. Its job is to feed
everything to the output at the right time. I won't go into
details here since no improvements are necessary.
> What is the need to use a 32-bit 'int' value for this instead
> (except for optimization - and optimizations come afterwards, after
> profiling)?
I hope I've answered your questions with the above explanations. We
need both, the named entity and its corresponding input character
code.
> This first step is to make the treatment of the Unicode glyphs
> algorithmic rather than table-based.
I fully agree.
> _If_ tables are needed that the user needs to customize - the Asian
> double-width property comes to mind: it depends on the terminal
> emulator being used -
A different terminal emulator represents a different output device in
case there are different glyph widths (otherwise troff won't be able
to produce justified output). Or do you mean something else?
> it should IMO be done through a specialized representation that is
> economic both in space in the font file format and in memory, rather
> than a representation that enumerates character after character.
This is what I mean with `classes', something like this in a font
description file, using two new sections:
classes
<Alike> = A :A 'A `A ... ;
<CJKpunct> = U+3000 - U+303F;
<Hiragana> = U+3040 - U+309F;
...
<CJK> = <CJKpunct> <Hiragana> ... ;
properties
<CJK> width 24
...
<Alike> kern V -3
I've no idea how to store such information efficiently within memory.
Maybe something similar to the `sparse arrays' as used in Emacs...
Suggestions?
> Also, do you think these glyph classes depend on the font, or only
> on the device to which the font belongs?
Glyph classes are a property of the font only. Maybe it is useful to
provide a generic `glyphclass' file which provides default classes, to
be overridden in the particular font, but this is a refinement which
we can ignore now.
> Thanks for your agreement. Then this will be the next step, after
> the patches that I've already submitted.
Excellent.
> Up to now I didn't even know that these were three different data
> types; I was only looking at the font class.
Aah, this explains the difficulties I have to answer your questions in
a simply way.
> I assume, an element of a font - often called "index" - is a glyph.
> What is an "output character" then?
Just sloppy wording by me :-) Well, TTY devices basically convert
troff glyphs back to output characters, but this is just nit-picking.
Werner