bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctua


From: Eli Zaretskii
Subject: bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
Date: Fri, 12 Jun 2015 11:28:09 +0300

> From: Glenn Morris <rgm@gnu.org>
> Date: Thu, 11 Jun 2015 18:24:06 -0400
> 
> Glenn Morris wrote:
> 
> >   Error (initialization): Creation of the default fontsets failed: (error
> >   Invalid script or charset name: cuneiform-numbers-and-punctuation)
> 
> I fixed a typo that seems to have caused that.

Sorry about that.

> I don't suppose that big list can be auto-generated from the inputs?

It's not trivial.  I describe below some of the issues, in the hope
that Someoneā„¢ will volunteer:

  . Most of the script names come from the corresponding Unicode
    blocks, with trivial transformations (downcase words and replace
    blanks with a hyphen).  So basically, we will need to use the
    information in Blocks.txt, a file that is part of the Unicode
    Character Database (UCD), but with quirks described below.

  . The first quirk is that we lump together all the blocks that
    belong to the same script, like "Basic Latin", "Latin Extended-A",
    "Latin-1 Supplement", etc. -- these all go to the single script
    called 'latin'.  Likewise with other similar blocks that are
    either "SOMETHING Extended" or "Supplement" or whatever.

  . The second quirk is with the CJK characters: those are divided
    into several broad scripts like 'han', 'kana', and 'cjk-misc'
    whose exact rules I don't know.

  . The third quirk is with the 'symbol' pseudo-script: we lump there
    all punctuation characters and all symbol characters (those for
    which the General Category is one of Pc, Pd, Ps, Pe, Pi, Pf, Po,
    Sm, Sc, Sk, So), but with the following notable exception:
    punctuation characters that belong to blocks that include
    non-punctuation characters are left in those blocks -- those are
    punctuation characters used only with the scripts named by those
    blocks, like U+05BE HEBREW PUNCTUATION MAQAF, which is only used
    by the Hebrew script.

  . Another quirk is that mathematical alphanumerics (which are just
    letters from the Unicode POV) are lumped into a separate script
    'mathematical'.

Alternatively, one could use Scripts.txt from the UCD, and then the
only problem is to subdivide what they call "Common" into the scripts
we use.

For the general category of a character, one can do in Emacs:

      (get-char-code-property CHAR 'general-category)

Alternatively, one can search UnicodeData.txt directly: the General
Category is the 3rd field there.

Patches are welcome to do all of the above automatically, perhaps with
some small database that expresses the more tricky of the above rules.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]