[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort
From: |
Nathan Moore |
Subject: |
Re: sort |
Date: |
Mon, 29 Aug 2005 23:42:57 -0400 |
User-agent: |
Mozilla Thunderbird 1.0.6 (X11/20050716) |
Bob Proulx wrote:
Nathan Moore wrote:
I guess that the best way to put it is, what is the correct behavior
when none of the LC_ environmental variables
are set?
What is the output of 'locale'?
locale
address@hidden:~> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
address@hidden:~>
address@hidden:~> set | grep LC_
address@hidden:~> set | grep LANG
LANG=en_US.UTF-8
address@hidden:~>
This is what I got w/o me actually setting anything.
That will display the settings according to the environment
variables. If none are set then you will get a C/POSIX locale by
default. But that command will display them individually.
This really isn't mentioned in the documentation (or I wasn't
able to find it). My version of coreutils
is 5.2.1, which is the most recent.
Please suggest improvements to the documentation so that they can be
improved. The info docs currently say this:
(1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
`en_US'), then `sort' may produce output that is sorted differently
than you're accustomed to. In that case, set the `LC_ALL' environment
variable to `C'. Note that setting only `LC_COLLATE' has two problems.
First, it is ineffective if `LC_ALL' is also set. Second, it has
undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is
set to an incompatible value. For example, you get undefined behavior
if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.
How might that be improved?
Looking at this now I think suggesting to set LC_ALL=C is too strong.
I know why it was done, so that it would override LANG. But now I
think it should probably just suggestion LANG with the warning that
LC_COLLATE overrides LANG and LC_ALL overrides LC_COLLATE.
A pointer in the man and info to locale(1) would be nice. One problem I
had was that I really
didn't know what were legal settings for the variables, and the sort
docs really didn't point
me anywhere. Also, the default actions should be listed better.
Knowledge of glibc locale
should not be assumed by users of utility programs. I'm a programmer
and I've never actually
messed with that stuff until today (just never had a need for it --
probably would have noticed
if I wasn't US/English).
I've never really messed w/ the LC_ environmental variables before
and some of mine were not set (on SuSE 9.2).
You don't need to set all of them. Just the ones you want. Don't try
to set them all. Personally I use this:
export LANG=en_US.UTF-8
export LC_COLLATE=C
I've figured it out (export `locale`), but it seems like that is one
of those things that just isn't written down anywhere.
You should not need to do that. I recommend against it.
Since sending the initial report, I had figured out that
"LC_COLLATE=ascii sort" did what I wanted.
Hmm... I think "ascii" is actually unrecognized and that is causing a
fallback to C/POSIX. I think other programs will complain when they
can't find that locale data. So this will actually create other
errors. Better to set this to C or POSIX instead.
Well, that is odd. I would have thought that LC_COLLATE being
undefined, being set
to empty, or being set to something invalid would all have the same
effect. But from
sort I got ... Just noticed something -- I actually overlooked the fact
that the lines were
partially sorted with the defaults, but completely sorted with the
correct LC_COLLATE.
Ok... I'm attaching operators.tx_ (edited down version of a much longer
file operators.txt)
that has 1 or 2 columns per line. The first line is an operator for a C
like programming
language and the optional second column is a lex action for that
operator. I was trying
to use sort as a quick way to make sure I hadn't misses an operator in
one of 2 lists-- since
the sorted lines would have duplicates on adjacent lines which would be
easy to spot, leaving
the operators without a match needing further attention in either the
lex file or the other file.
I had just "cat"ed the 2 file segments together (was not using sort's
merge features).
I'm also attaching the output of a couple of runs of sort on this file.
The filenames have the
environmental variables involved encoded into them and should be easy to
figure out.
LANG="en_US.UTF-8" for all runs.
LC_COLLATE="en_US.UTF-8" gave an empty file as output, but if export
`locale` is
run prior to running the sort (which sets LC_COLLATE and a bunch of
other stuff to
"en_US.UTF-8), then the output is the same as if LC_COLLATE= any one of
"POSIX",
"C", "ascii", or "your_mama". This was actually the behavior that I
wanted, but was
not what I got w/ LANG="en_US.UTF-8" and LC_COLLATE not set.
I'm going to go investigate the locale settings more on my own. Any
pointers to places to
look for C, shell, and system configuration stuff related to this.
Thanks for your replies, and please tell me what the behavior is
without any LC_ settings. I'm just curious.
You get C/POSIX sort ordering by default if none of LC_ nor LANG
(don't forget LANG) is set.
So, If LC_'s are not set, but LANG is, what method of comparing used?
Note that GNU coreutils does not set any of the locale settings in
your environment. This was very likely done by your distro. I
believe that doing this without notifying the user is a distro problem
and not a coreutils problems. You might need to address this problem
with your distro.
I know that they (coreutils) do not set up the environment. Distro
setups should
probably have options to delve into these settings a bit more during
installs.
(funny aside -- I had a Red Hat distro once that didn't come w/ stat.
That should
have been illegal)
Thanks again for all of the time and help y'all have given me.
Nathan
=
,
:
::
?
?=
.
...
.@
(
)
[
]
{
}
&
&=
&&
& {ASCIIOP_RETURN(AND);}
. {ASCIIOP_RETURN(DOT);}
[ {ASCIIOP_RETURN(LB);}
{ {ASCIIOP_RETURN(LC);}
- {ASCIIOP_RETURN(MINUS);}
~ {ASCIIOP_RETURN(NEGATE);}
! {ASCIIOP_RETURN(NOT);}
+ {ASCIIOP_RETURN(PLUS);}
] {ASCIIOP_RETURN(RB);}
} {ASCIIOP_RETURN(RC);}
* {ASCIIOP_RETURN(STAR);}
# {NAMED_PPOP_RETURN('#') ;}
## {NAMED_PPOP_RETURN(POUNDPOUND);}
, {PPOP_RETURN(COMMA);}
( {PPOP_RETURN(LP);}
) {PPOP_RETURN(RP);}
=
::
:
?=
?
&
&=
&&
.
.@
...
,
[
]
{
}
(
)
( {PPOP_RETURN(LP);}
) {PPOP_RETURN(RP);}
, {PPOP_RETURN(COMMA);}
# {NAMED_PPOP_RETURN('#') ;}
## {NAMED_PPOP_RETURN(POUNDPOUND);}
{ {ASCIIOP_RETURN(LC);}
} {ASCIIOP_RETURN(RC);}
[ {ASCIIOP_RETURN(LB);}
] {ASCIIOP_RETURN(RB);}
. {ASCIIOP_RETURN(DOT);}
& {ASCIIOP_RETURN(AND);}
* {ASCIIOP_RETURN(STAR);}
+ {ASCIIOP_RETURN(PLUS);}
- {ASCIIOP_RETURN(MINUS);}
~ {ASCIIOP_RETURN(NEGATE);}
! {ASCIIOP_RETURN(NOT);}
! {ASCIIOP_RETURN(NOT);}
# {NAMED_PPOP_RETURN('#') ;}
## {NAMED_PPOP_RETURN(POUNDPOUND);}
&
& {ASCIIOP_RETURN(AND);}
&&
&=
(
( {PPOP_RETURN(LP);}
)
) {PPOP_RETURN(RP);}
* {ASCIIOP_RETURN(STAR);}
+ {ASCIIOP_RETURN(PLUS);}
,
, {PPOP_RETURN(COMMA);}
- {ASCIIOP_RETURN(MINUS);}
.
. {ASCIIOP_RETURN(DOT);}
...
.@
:
::
=
?
?=
[
[ {ASCIIOP_RETURN(LB);}
]
] {ASCIIOP_RETURN(RB);}
{
{ {ASCIIOP_RETURN(LC);}
}
} {ASCIIOP_RETURN(RC);}
~ {ASCIIOP_RETURN(NEGATE);}
- sort, Nathan Moore, 2005/08/29
- Re: sort, Bob Proulx, 2005/08/29
- Re: sort, James Youngman, 2005/08/29
- Re: sort, Nathan Moore, 2005/08/29
- Re: sort, Paul Eggert, 2005/08/29
- Re: sort, Bob Proulx, 2005/08/29
- Re: sort,
Nathan Moore <=
- Re: sort, Bob Proulx, 2005/08/30