emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: size of emacs executable after unicode merge


From: Stephen J. Turnbull
Subject: Re: size of emacs executable after unicode merge
Date: Sat, 17 May 2008 07:07:58 +0900

Thomas Lord writes:

 > Jason Rumney wrote:
 > > How big are the data structures holding all the unicode character info 
 > > and translation tables for encodings? 

Is it possible that the whole Unicode range (17*2^16 code points) is
being dumped?  That would lead to about the size change observed,
extrapolating from my "naive estimate" for XEmacs implementation of
the BMP given below.  But surely no characters outside of the BMP are
needed to dump Emacs.

 > If that turns out to be the problem, will someone please contact me 
 > directly?
 > (I ask that because I mostly just skim this list and so miss things.)
 > 
 > Several years back I devoted a pretty decent number of hours to working
 > out good ways to compress the run-time representation of such tables
 > without sacrificing much performance on accesses.

Loading on demand is generally a better solution, as most non-Asians
use less than 500 characters, highly localized to about 3 ranges that
can be loaded individually.

Nor do you really need "good solutions", as half of the BMP is hanzi
and Hangul which are basically constant ranges for the character info
tables, and another 10% is private space and surrogates, leading to
approximately 60% savings by using ranges and appropriate defaults for
these four classes.  The non-BMP planes surely can be loaded on-demand.

 > If it would be helpful,

Did you do much better than 60% savings?  If not, it's probably not
really worth much effort given an efficient range table representation
already available.  In any case, something else is going on here
besides naive representation (assuming we're restricted to the BMP).

In XEmacs, where all coding tables for the BMP are loaded by default,
much more naive strategies than those outlined above give 891800 bytes
total for the to-unicode and from-unicode tables.  I think we're
missing a couple of charsets that Emacs Mule provides, but they're
minor.  We don't currently implement the Unidata base, but most (all?) 
of the character properties can be compactly represented as a small
number of Booleans each, so a table of bitvectors for the BMP "should"
only be about 256KB or maybe 512KB.  IIRC XEmacs/UTF-2000 implemented
the BMP Unidata as a Lisp array of Lisp bitvectors in about 1MB (most
of which is Lisp object overhead).

In other words, even with a naive strategy, the Unicode BMP database
should only add about 1.1MB to 1.4MB, ie, about 10% of the size
increase seen here, if coded compactly but straightforwardly in C.

A few straightforward optimizations can probably get that down to
500KB to 700KB, and for an on-demand setup, most Western users should
only see a footprint of about 10-15KB for Unicode data, if that.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]