bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20638: BUG: standard & extended RE's don't find NUL's :-(


From: Linda Walsh
Subject: bug#20638: BUG: standard & extended RE's don't find NUL's :-(
Date: Sun, 24 May 2015 23:48:03 -0700
User-agent: Thunderbird



Eric Blake wrote:
On 05/23/2015 06:04 PM, L. A. Walsh wrote:
the standard & extended RE's don't find find NUL's:

Because NULs imply binary data,
I can think of multiple cases were at least 1 'nul'
would be found in text data -- the most prime example
being that it is a Microsoft Text file.
While MS usually uses a BOM at the beginning of
files, since NT's original format was only LSB/UCS-2, one
still runs into the occasional file -- but just rare enough that
I don't have the vim command to change it in the buffer to a compat
format that I waste time looking it up.

But more to the point some unix files were designed to
work on file -- not just limited to text -- 'strings' for
example.  Right now, it seems grep has lost much in the
'robust' category -- I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file

"-a" doesn't work, BTW:

Ishtar:/tmp> grep -a '\000\000' zeros
Ishtar:/tmp> echo $?
1
Ishtar:/tmp> grep -P '\000\000' zeros Binary file zeros matches

But there it is -- if grep wasn't meant to handle binary files,
it wouldn't know to call 'zeroes' a binary file.

Many of the coreutils have worked equally well on binary
as well as txt.  (cat, split, tr, wc to name a few).  But how
can 'shuf' claim to work on input lines yet have this allowed:

  -z, --zero-terminated
line delimiter is NUL, not newline.

'nl' claims the file, 'zeros' (4k of nulls -- created
by bash, that can write a file of zeros, but not read it)
is 1 line.

'pr' will print it (though not too well).

'xargs': <zeros xargs -0 |wc 1 0 4096

POSIX is a least common denominator -- it is not a standard
of quality in any way.  People argue to dumb down POSIX
utils, because some corp wants to get a posix label but
has a few shortcomings -- so they donate enough money and
posix changes it's rules.

'less' works with it, but 'more' works faster (just doesn't
display ctl chars). --- but one of the files I searched through
was base64 encoded, and in at least 2 places in the file were
a a run of ~100-200 zeros (in a 10k or more file).
(That's what I'm looking for -- signs of corruption)...

 and grepping binary data has unspecified
results per POSIX.  What's more, the NEWS for 2.21 documents that grep
is now taking the liberty of treating NUL as a line terminator when -a
is not in effect, thanks to the behavior being otherwise unspecified by
POSIX.
----
With a "-0" switch, I presume (not default behavior -- that would
be ungood :^/ )

Try using 'grep -a' to force grep to treat the file as non-binary, in
spite of the NULs.
doesn't work -- as mentioned above.  I'd say it's a bug
fair and square...





reply via email to

[Prev in Thread] Current Thread [Next in Thread]