emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF


From: stephen
Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
Date: Sun, 27 Sep 2015 15:20:56 +0900

Paul Eggert writes:

 > I think your information is out of date.

Rather, I think that yours is superficial.  Really, you should listen
to those of us who live and work outside of the ASCII hemisphere.

I live and teach in Japan (a stone's throw from ETL, as it happens),
and most of the students I supervise are Chinese.  I regularly need to
access Chinese and Japanese government and corporate data, and
retrieve preprints and data (and sometimes code) from the personal
pages of other scholars.  Mojibake in the HTML pages is frequent, in
both Firefox and Chrome (of course it's almost always easy to guess
the actual coded character set in use, but it is mojibake).  A
frequent cause is webservers configured to send "Content-Type:
text/html; charset=utf-8" but the page is encoded in something else.

 > Yes, ten years ago there was a lot of non-UTF-8 out there, but
 > nowadays they've largely moved on to UTF-8.

"Beauty is only skin-deep."  The *top* pages, and some whole sites,
have moved on, because having beautiful (if mostly useless) top pages
is a matter of "face", so they buy new ones from companies with fancy
up-to-date web design software every couple of years.  Perhaps most
recently authored pages are UTF-8.  But the data sets themselves are
typically flat files, either CSV or plaintext.  The explanatory pages,
even if in HTML, often haven't been revised in decades.  Such useful
content is typically in a national standard coded character set rather
than Unicode.

And Emacs is hardly limited to the web.  In practice, almost all mail
I receive from Chinese (even when it is in English or Japanese) is
labelled GB2312, GBK, or GB18030.  The great majority of Japanese mail
is either Shift JIS or ISO 2022 JP (sometimes with "OEM characters"
that even today aren't in Unicode because they're not in JIS).

 > Of course one can still find a few web sites using other encodings,
 > but like it or not, UTF-8 dominates now.

What's not to like about UTF-8?!  I *wish* non-UTF-8 was a matter of
information archaeology and Buddhist scholarship!  I'm sad to say, it
is not: GB variants, Big5, and JIS variants are the *majority* of the
non-ASCII data I handle every day in my Emacs.  (It's not the "great
majority" only because about 30% of the non-ASCII text I handle in
Emacs is authored by me, in UTF-8, of course.)

Regards,



reply via email to

[Prev in Thread] Current Thread [Next in Thread]