bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#21251: sed: POSIX and the z command


From: Assaf Gordon
Subject: bug#21251: sed: POSIX and the z command
Date: Sat, 28 Jan 2017 21:04:25 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

Hello Stephane,

On Sat, Jan 28, 2017 at 10:01:55AM +0000, Stephane Chazelas wrote:
It doesn't preclude the use of regexec. It just leaves the
behaviour unspecified when the input is not text

Thanks for the clarification.

I'd argue that for sequences of bytes that don't form valid
characters, it would be nicer if "." or "[^anything]" matched
each of the individual bytes.

Concretely, GNU sed uses several regex engines now (gnulib's dfa for
fast matching, then either glibc's or gnulib's RE for general matching and substitution).

To support this behaviour we'll need to ensure all of them behave in
the same reproducible and reliable manner (not impossible, just a TODO).

You can still find the discussion using the NNTP interface. I
attach the most relevant message (from Geoff Clare of the Austin
group). I can send you the whole discussion as a mailbox file if
you like.

I would appricate if you could send it to me - I'm interested
in multibyte processing for other gnu programs as well.


From: Geoff Clare <address@hidden>
GNU sed even went as far as defining a new command for emptying
the pattern space to work around that problem:
[...]
Is that claim (about it being a POSIX requirement) true?

I think it's true for regexec(), but not for sed.

(Perhaps we should add a REG_EILSEQ error return for regexec().)

I'd expect the behaviour to be unspecified if the input is not
text (as would be the case if there are invalid multi-byte
sequences).

Exactly.

So the above somewhat confuses me (as my previous email):

Let's say I was to write a new simple 'sed' for POSIX systems.
If POSIX/OpenGroup encourages me (as a software writer for posix
systems) to use the POSIX regexec API, then implicitly my 'sed'
program wouldn't match invalid multibyte sequences.
But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
it means that in practical terms I shouldn't use POSIX API and
implement my own regex engine...

You compared it with LINE_MAX, but realistically, implementing support for lines longer than LINE_MAX is very different scale of effort than implementing a new regex engine...

What am I missing ?

Thanks!
- assaf







reply via email to

[Prev in Thread] Current Thread [Next in Thread]