emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF


From: Eli Zaretskii
Subject: Re: [Emacs-diffs] master db828f6: Don't rely on defaults in decoding UTF-8 encoded Lisp files
Date: Sun, 27 Sep 2015 10:27:58 +0300

> Cc: address@hidden, address@hidden, address@hidden
> From: Paul Eggert <address@hidden>
> Date: Sat, 26 Sep 2015 13:32:33 -0700
> 
> Eli Zaretskii wrote:
> > The relevant statistics for Emacs is of source files, not of HTML
> > pages.
> 
> Sure, and source files are how this thread got started: nowadays in GNU 
> projects 
> they're typically UTF-8 regardless of system locale settings, and Emacs 
> should 
> be better about supporting this typical situation.  UTF-8 is common partly 
> because source files are shared widely via the Internet, on sites like 
> Savannah.
> 
> The days of lonely hackers writing code in their own private Shift-JIS 
> directories are largely over.  Of course Emacs can still support such users, 
> but 
> the default should be tailored to what's more typical nowadays.

Emacs supports the typical situation quite well already, definitely so
in a typical (i.e. UTF-8) locale.  The issue at hand is not how to
support the typical situation, it's whether that typical situation is
the _only_ situation that matters, so much so that we can ignore the
locale-derived defaults.

In any case, I said we needed _statistics_, i.e. numbers, not just
impressions and opinions.

I don't know how to find a representative set of C sources, not even
for European locales.  I looked at the C files of GNU projects from
the last years on my main development system, which is probably not
very representative.  There are more than 142,000 C files there.
Using the 'file' utility, I found about 1.8% of UTF-8 encoded files
and about 0.2% ISO-8859 encoded files (the vast majority was US ASCII,
of course).  That's still more than 250 ISO-8859 encoded files.

I've also looked at the *.po files in the latest releases of GNU Make,
Gawk, Texinfo, and Binutils, and I find that between 20% and 25% of
such files still use non-UTF-8 encodings.  I see similar figures for
the txi-*.tex files that came with Texinfo 6.0.  Presumably, that
follows the default conventions of the respective locales.

So, while I agree with you that UTF-8 encoded files are the majority
among non-ASCII files (and Emacs development aligns itself with that
fact very well), the non-UTF-8 minority, even in the Posix world, is
still significant enough, and we cannot possibly ignore it.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]