[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: IFS whitespace definition
From: |
Robert Elz |
Subject: |
Re: IFS whitespace definition |
Date: |
Mon, 21 Oct 2024 21:13:48 +0700 |
Date: Fri, 18 Oct 2024 16:28:06 -0400
From: Chet Ramey <chet.ramey@case.edu>
Message-ID: <a0363635-e1bf-4c0d-951c-a74b73f00ccc@case.edu>
| At the time (previous edition of the standard), POSIX defined whitespace
| as "In the POSIX locale, white space consists of one or more <blank> (
| <space> and <tab> characters), <newline>, <carriage-return>, <form-feed>,
| and <vertical-tab> characters."
Yes, that was from XBD 3.142 (In issue 7), labelled "White Space"
| The word splitting section wasn't quite
| as rigorous as the current version's, but it referenced this definition.
Actually, it didn't, which was one of that section's many problems.
What it said was
The term ``IFS white space'' is used to mean any sequence (zero or
more instances) of white-space characters that are in the IFS value
There's no reference to anything in XBD, and the term it uses is
"white-space" not "white space" which the definitions define. And
yes, that hyphen really makes a difference in things like this.
| However, the conformance suite tests for this.
That has tested for what its developers thought the standard said,
rather than what it actually says, before, and probably will again.
| The comment in locale_setblanks explains this: some systems, like macOS,
| return true from isspace() for characters between 0x80 and 0xff even
| though they introduce multibyte characters (every locale besides "C"
| in macOS uses UTF-8 encoding).
I assume that the macos people assume that if you're fetching multi-byte
characters you should be fetching the whole character before testing what
kind of object it is.
That's certainly what the new standard requires of processing IFS - even
though what it is splitting is just treated as bytes, deciding what is
IFS white space needs to use properly decoded characters from IFS, not
just treat it as a byte string. Then when testing the field being
split (or the line read in the case of the read builtin, if the sequence
of bytes at the current position matches a character in IFS, then that's
a match, if not, one byte gets removed, and try again (that's processing
the input as a byte sequence). And yes, that's still something of a mess,
but (at least) when using UTF-8 encoding it all ends up working in any
case where it possibly can. In other multi-byte locales, anything is
possible.
kre