bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctua

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctua

From:	Eli Zaretskii
Subject:	bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
Date:	Sun, 21 Jun 2015 18:00:20 +0300

> From: Glenn Morris <rgm@gnu.org>
> Cc: 20789@debbugs.gnu.org
> Date: Sat, 20 Jun 2015 19:34:01 -0400
> 
> I spent some time looking at some of these.
> In no case could I see a clear path from the inputs to the outputs.

Thanks for looking into this.

Let me first make a general comment: we can always convert only
certain parts of the setup to an automated procedure, and leave the
rest in its present form, more or less.  That's especially true where
Emacs has specialized needs or defines properties not in Unicode.

> >   . characters.el:
> >
> >     . The modify-category-entry calls -- they basically can be derived
> >       from Blocks.txt
> 
> I looked at it briefly. I can see that they are somewhat related, but
> not precisely how. Eg:
> 
> Emacs: 2E80:312F and 3190:33FF are "line breakable".
> Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.
> 
> Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
> Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.
> 
> I didn't look any further.

When I said "derived from Blocks.txt", I meant the categories that are
related to script names, like ASCII, Latin, Arabic, Chinese, etc.
Sorry for not saying that explicitly.

Other categories need other sources.  Here's my attempt to decipher
some of them:

 . ?| -- "line breakable"

   The data seems to be in LineBreak.txt, described in detail in
   UAX#14 (http://unicode.org/reports/tr14/).  It looks like
   characters with the ?| category are those whose line-break
   properties are ID or CJ or NS.  Therefore, the exclusion of Hangul
   Compatibility Jamo is a mistake (or maybe an omission, since the
   comment says "Chinese"); in particular, UAX#14 explicitly says, in
   section 5.1 under "ID", that the characters in the range 3130..318F
   are treated as class ID.

   This category is currently used only by kinsoku.el, which has its
   own data (and sets the ?< and ?> categories).  So this will only
   become important if we ever implement in Emacs something more
   general, like the algorithm described in UAX#14.

 . "2-byte han" -- I think this is related to their legacy encoding; I
   don't see this used anywhere.  Likewise with other 2-byte
   categories.  Perhaps Handa-san (CC'ed) could comment on their
   necessity.  If this is still needed, we should probably leave these
   alone.

 . ?0 - ?9 -- I don't see how to get this data from the UCD or any
   other source.  Some of it seems to be in IndicSyllabicCategory.txt,
   FWIW.

 . ?R and ?L -- already set up using the Unicode data, so no change is
   needed.

 . ?^ -- should be set for any character whose general-category is
   Mn.  Since we already do this, the manual setting around line 820
   is redundant and should be deleted.

 . ?. -- already set using Unicode data, no change needed.

> >     . The setup of char-width-table -- I think the information is in
> >       EastAsianWidth.txt, with background information described in
> >       UAX#11 (http://www.unicode.org/reports/tr11/)
> 
> Looks somewhat promising, but could you be more specific?
> There's nothing in that file that defines "zero width" characters, so I
> don't see where Emacs's width 0 characters come from.

The following rules regarding zero-width characters are due to Markus
Kuhn, and are excerpted from the description in comments to his
implementation of 'wcwidth' (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c):

 . The null character (U+0000) has a column width of 0.
 . Non-spacing and enclosing combining characters (general category
   code Mn or Me in the Unicode database) have a column width of 0. 
 . ZERO WIDTH SPACE (U+200B) and format characters (general category
   code Cf in the Unicode database), except SOFT HYPHEN (U+00AD), have
   a column width of 0.
 . Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) have
   a column width of 0.

> The width 2 characters look like they might be the "W" and "F" characters,

Yes.

> but just doing that gives a list that has many differences to the list
> Emacs uses.

I don't see any significant differences, except perhaps in unassigned
codepoints (see paragraph 6.1 of UAX#11 for the treatment of
unassigned CJK codepoints).  I think any differences beyond that
should be treated as errors in Emacs in this case.

> >     . The setup of char-acronym-table: at least some of the data is in
> >       NameAliases.txt and NameList.txt
> 
> Looks somewhat promising.
> I can see how most of this comes from NameAliases.txt.
> But there are many oddities:
> 
> Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
> or EOF)?

This table is set for the 'acronym' method of glyphless-char-display,
so I guess these omissions are for characters for which no one
envisioned them to be ever displayed as glyphless.  I'd include them
in the table anyway, just in case, and also to keep our exceptions vs
the UCD to the bare minimum.

> 0019 is EOM in the source but EM in Emacs.

Typo, I think.

> 0080 is PAD in the source but XXX in Emacs.
> 0081 is HOP in the source but XXX in Emacs.
> 008F is SS3 in the source but SS1 in Emacs.
> 0099 is SGC in the source but XXX in Emacs.

I think these are typos and perhaps acronyms that whoever wrote this
didn't know.

> How does Emacs choose which entries to list? There are many more in the
> source. Could it do any harm to add more?

As long as you take only "abbreviations", i.e. they are short, I think
we should use all of them, given their use in Emacs.

> Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.

AFAIK, that's the official name of that character.  At least that's
what I glean from Google; I know nothing about the Khmer script.

> Why does Emacs list two Khmer entries, and nothing else? There are loads
> more of them.

These are the only 2 that have such abbreviations; see
https://en.wikipedia.org/wiki/Khmer_alphabet (assuming by "loads more"
you meant the Khmer letters).

> >   . fontset.el:
> >
> >     . The setup of script-representative-chars
> 
> I don't see how. It seems to be "for some of, but not all, the entries
> in char-script-table, choose a single character somewhere in the range."

We should have a representative character for each entry in
char-script-table.  They are used with some font back-ends (xfont and
x?ftfont, AFAIR) to probe candidate fonts for coverage of the required
script, so we should have the full information about that.  I think
the only reason for the partial information we have now is that it is
maintained manually, so it includes whatever the people who worked on
that bothered to add.

> There seems to be no pattern to how the character is chosen within the
> range. Often the first one, but by no means always.

I think the rule is to choose the first one that is a letter, i.e. its
general-category is either one of Lu, Ll, Lt, Lo, or Lm.

> >   . mule-cmds.el:
> >
> >     . The setting of locale-language-names -- the data is available in
> >       IANA's Language Subtag Registry
> >       
> > (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
> >       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
> >       http://www.loc.gov/standards/iso639-2/php/English_list.php)
> 
> Again, I don't see how. Eg nowhere in those source files do I see Welsh
> associated with iso-8859-14, and the comment in mule-cmds says that the
> last part is "implementation dependent".

The bulk of the data is the correspondence between the ISO 639
2-letter names and the country/culture name.  The few cases where we
also have the encoding could be set up with a very small database once
the main data is set, by adding the encoding to those few that need
it.

If by "last part" you mean IPA and "Nonstandard or obsolete language
codes", then these are very few and can be added manually.

> > P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> > reminder to fetch all those reference files and regenerate their
> > dependencies, before we prepare a release.
> 
> admin/FOR-RELEASE contains that kind of thing.

Right, I will add the information there.

Thanks.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Glenn Morris, 2015/06/11
- bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Glenn Morris, 2015/06/11
  - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Eli Zaretskii, 2015/06/12
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Glenn Morris, 2015/06/15
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Eli Zaretskii, 2015/06/16
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Glenn Morris, 2015/06/17
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Eli Zaretskii, 2015/06/17
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Glenn Morris, 2015/06/20
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Eli Zaretskii <=
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Glenn Morris, 2015/06/26
    - bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation, Eli Zaretskii, 2015/06/27

Prev by Date: bug#20843: 24.5; Profiler error: "Invalid sampling interval"
Next by Date: bug#20851: 24.5; No 64-bit Emacs for Windows while the manual said there is.
Previous by thread: bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
Next by thread: bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
Index(es):
- Date
- Thread