bug#21251: sed: POSIX and the z command

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#21251: sed: POSIX and the z command

From:	Stephane Chazelas
Subject:	bug#21251: sed: POSIX and the z command
Date:	Sat, 28 Jan 2017 10:01:55 +0000
User-agent:	Mutt/1.5.24 (2015-08-30)

2017-01-28 01:48:19 +0000, Assaf Gordon:
[...]
> On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote:
> >[...] The behaviour
> >of sed on non-text input is unspecified, so it doesn't require
> >that . not match a byte that is not part of a valid character.
> >[...]
> >That POSIX requirement is true for regexec() but not for text
> >utilities.
> 
> I'm far from familiar with POSIX intricacies, but doesn't that sound a bit
> strange ?  I would naively think that POSIX would encourage POSIX-compliant
> test utilities to use the system's native regexec implenentation, instead of
> supporting slightl different semantics...

Hi Assaf,

It doesn't preclude the use of regexec. It just leaves the
behaviour unspecified when the input is not text, like when
lines are longer than LINE_MAX or when they contain NUL bytes or
when they contain sequences of bytes not forming valid
characters or when there are characters after the last newline
character.

Upon sequences of bytes that don't form valid characters, you're
free to exit with an error, shut down the computer, or whatever
you like, POSIX doesn't care.

What POSIX tells the user of the POSIX API (that is script
writers, sed user) is that they can't expect anything on
non-text input.

GNU sed already handles lines longer than LINE_MAX nicely, as
well as lines containing NUL bytes or an unterminated last line.

I'd argue that for sequences of bytes that don't form valid
characters, it would be nicer if "." or "[^anything]" matched
each of the individual bytes. It's what bash's * and ? and
[!anything] fnmatch() patterns do (even though in that case
POSIX seem to forbid it; that has been discussed on the austin
group mailing list as well). 

> >See that discussion on the Austin Group mailing list:
> >http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098
> 
> This link seems broken. Would you know where to find this discussion online
> ?
[...]

Yes. They relied on gmane for the mailing list archive. The web
interface has been discontinued
(https://lars.ingebrigtsen.no/2016/07/28/the-end-of-gmane/),
then taken over by somebody else, but not everything is back.
https://lars.ingebrigtsen.no/2016/09/06/gmane-alive/comment-page-1/

You can still find the discussion using the NNTP interface. I
attach the most relevant message (from Geoff Clare of the Austin
group). I can send you the whole discussion as a mailbox file if
you like.

-- 
Stephane

--- Begin Message --- Subject: Re: UTF-8 and non-characters Date: Wed, 1 Jul 2015 10:55:14 +0100 User-agent: Mutt/1.5.21 (2010-09-15)

Stephane Chazelas <address@hidden> wrote, on 30 Jun 2015:
>
> Speaking of which, would a pseudo-UTF-8 locale where bytes that
> don't form valid characters are mapped to a character like
> U+FFFD (�) be POSIX compliant.
> 
> Like c3 a9 is é, but c3 41 a9 is �A�
> 
> or if not all mapped to a single character, mapped to dedicated
> unassigned code points (0x7fffff80 to 0x7fffffff for instance)? 
> 
> For instance, above c3 41 a9 being <U+7fffffc3>A<U+7fffffa9>
> 
> If allowed, would that not be desirable (I can see it
> potentially be a problem when processing partial input)?

I think this would cause inconsistency between btowc() and the various
multi-byte to wide-character conversion functions.

If btowc(0xc3) returns a wide character, then mbtowc() on c3 a9 ought
to convert the c3 to that wide character and return 1, instead of
converting c3 a9 to a wide é and returning 2.

Conversely, if btowc(0xc3) returns WEOF, then mbtowc() on c3 41 a9
ought not to convert the c3 to a wide character.

> A common source of bugs and security vulnerabilities with
> UTF-8 is that fact that not all sequences of bytes map to
> characters and in particular that they're not matched by RE's
> "." or ".*" or fnmatch()'s ? or *.
> 
> That's a common problem when you can't guarantee the input is
> valid text for instance for arbitrary file names from the file
> system. That's quite common when dealing with file names that
> were written in a single-byte character set in UTF-8 locales.
> 
> For instance,
> 
> find . -name '*'
> 
> With GNU find at least doesn't match on $'St\xe9phane.txt'
> (Stéphane.txt in the iso8859-1 charset).
> 
> An example of a more serious problem:
> 
> find . ! -name "* *" -exec cmd-that-would-break-with-spaces {} +

It looks like the pattern matching sections of the standard have
some problems with the use of the terms character and string.

2.13.1 says * matches "multiple characters", but 2.13.2 says it
matches "any string" in item 1 and then says it matches "a string
of zero or more characters" (i.e. any character string) in item 3.

> GNU sed even went as far as defining a new command for emptying
> the pattern space to work around that problem:
> 
> `z'
>      This command empties the content of pattern space.  It is usually
>      the same as `s/.*//', but is more efficient and works in the
>      presence of invalid multibyte sequences in the input stream.
>      POSIX mandates that such sequences are _not_ matched by `.', so
>      that there is no portable way to clear `sed''s buffers in the
>      middle of the script in most multibyte locales (including UTF-8
>      locales).
> 
> Is that claim (about it being a POSIX requirement) true?

I think it's true for regexec(), but not for sed.

(Perhaps we should add a REG_EILSEQ error return for regexec().)

> I'd expect the behaviour to be unspecified if the input is not
> text (as would be the case if there are invalid multi-byte
> sequences).

Exactly.

> See also
> http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8
> where we wondered whether grep -vx '.*' was required to report
> lines with invalid multi-byte sequences.

Unspecified, for the same reason as for sed.

> There was also a discussion earlier here about shells' ? and *
> on invalid byte sequences and most shells seem to match
> individual bytes from invalid multibyte sequences as one
> character (except for yash that won't deal with those at all)
> which seem to me like the safest thing to do.
> 
> What's the OpenGroup position on that?

2.13.1 is clear that ? matches a character.

The requirements for * are ambiguous because of the conflicting text
I pointed out above.

-- 
Geoff Clare <address@hidden>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

bug#21251: sed: POSIX and the z command, Assaf Gordon, 2017/01/27
- bug#21251: sed: POSIX and the z command, Stephane Chazelas <=
  - bug#21251: sed: POSIX and the z command, Assaf Gordon, 2017/01/28
    - bug#21251: sed: POSIX and the z command, Stephane Chazelas, 2017/01/31

Prev by Date: bug#21251: sed: POSIX and the z command
Next by Date: bug#21845: sed docs bug: No documentation of which commands must be terminated by newlines
Previous by thread: bug#21251: sed: POSIX and the z command
Next by thread: bug#21251: sed: POSIX and the z command
Index(es):
- Date
- Thread