help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

advice on hash tables?


From: Eric Abrahamsen
Subject: advice on hash tables?
Date: Fri, 04 Jul 2014 14:00:54 -0700
User-agent: Gnus/5.130012 (Ma Gnus v0.12) Emacs/24.4.50 (gnu/linux)

I'm (very slowly) chewing on some Chinese-English translation functions
based on the freely-available CEDICT dictionary[1], this is related to a
question about Chinese word boundaries I raised earlier.

The first stage is just slurping the text-file dictionary into an elisp
data structure, for simple dictionary lookups.

This is the first time I've made anything where performance might
actually be an issue, so I'm asking for a general pointer on how to do
this. The issue is that the dictionary provides Chinese words in both
simplified and traditional characters. The typical entry looks like
this:

理性認識 理性认识 [li3 xing4 ren4 shi5] /cognition/rational knowledge/

So that's the traditional characters, simplified characters,
pronunciation in brackets, then an arbitrary number of slash-delimited
definitions. There are 108,296 such entries, one per line.

So I'd like a hash table where characters are keys, and values are
lists holding (pronunciation definition1 ...).

I don't want to have to specify what type of characters I'm using, I'd
like to just treat all types of characters as the same. The brute-force
solution would be redundant hash-table entries, one each for simplified
and traditional characters. That would double the size of the hash table
to 200,000+.

Some character don't differ between traditional/simplified: in the
example above, only the second two characters are different. So I could
also define a hash table test that used string-match-p, and construct
the hash table keys as regexps:

"理性[認认][識识]"

Or I could try using the nested alists from mule-util.el, which I don't
frankly understand. It's possible you're meant to use nested alists
*instead* of something like a hash table. But if not, the keys might
look something like:

("理性" ("認識") ("识认"))

Or perhaps it would be faster to do:

(29702 24615 (35469 35672) (35782 35748))

But again, I'm likely misunderstanding how a nested alist works.

Anyway, dictionary lookups don't need to be super fast, but I'd like to
use the same or similar data structure for finding word boundaries, so
it would be nice to get something that goes fast. In any even, it's a
good opportunity to learn a bit about efficiency.

My actual question is: does anyone have any advice on a clever way to
approach this?

Thanks!
Eric



[1]: http://www.mdbg.net/chindict/chindict.php?page=cc-cedict




reply via email to

[Prev in Thread] Current Thread [Next in Thread]