bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30326: grep not searching through a text file (thinking it binary)


From: Paul Jackson
Subject: bug#30326: grep not searching through a text file (thinking it binary)
Date: Mon, 05 Feb 2018 23:38:05 -0600

Paul Eggert wrote:
>>  I was referring to text containing encoding errors without
>>  containing NULs
Ah - that makes sense.

The following experiment leads me to conclude that grep entirely
suppressesemitting any portion of a match that would contain an encoding
error, ratherthan emitting some substring of the match that can be correctly 
encoded.
That is, it seems that if grep is asked to emit what it thinks
would be amatch with an encoding error, grep seems to suppress that output line
entirely, and continues looking for matches that it can emit
without encodingerrors, and then at the end, if it saw a match that would have
emitted anencoding error, it issues the "*Binary file ... matches*" error, just
before exiting (or ending processing of that particular file.)

I demonstrated this by replacing the ELF executable of my previous
example withthe output of the following C program, which issues every possible
pair of bytes,except for no nul and no 255 bytes:

*main()**
*
*{**
*
*    int i, j;**
*
*    for (i = 1; i < 255; i++) {**
*
*        for (j = 1; j < 255; j++)**
*
*            printf("%c%c", i, j);**
*
*    }**
*
*    puts("");**
*
*}**
*

So I tested on a file (*/tmp/pjcc*) containing (1) a bunch of
ASCII C code,(2) output from the above program, and (3) another copy of the same
    ASCII C code.
Then, with the following settings:

*LC_COLLATE=C**
*
*LANGUAGE=en_US.UTF-8**
*
*LC_ALL=en_US.UTF-8**
*
*LANG=en_US.UTF-8**
*

I ran the command:

*grep "'N'" /tmp/pjcc**
*

I got the following output:

*         case 'N':**
*
*         case 'N':**
*
*Binary file /tmp/pjcc matches**
*

The "*case 'N':*" string appears once in the C code used in the file,
butthere are two copies of that C code in the file, so that grep prints
that line twice.
I also double checked that my file */tmp/pjcc* did not contain any
nul bytes.
The three character sequence *'N'* also appears in the middle section ofall 
non-nul, non-255 pairs of bytes, as well as in the ASCII C code, andit was (I 
presume) the match on that section of the file that
caused grepto issue the ""*Binary file /tmp/pjcc matches* complaint at the
end of its processing of that file.

If on the other hand, I ran the command:

*grep "'N':" /tmp/pjcc*

then I got the output:

*         case 'N':*
*         case 'N':*

with*_out_* any complaint that the *Binary file /tmp/pjcc matches.*

The four character sequence *'N':*  appears (twice) in the C code,
but zero times in the middle section of all non-nul, non-255
pairs of bytes.
>From this I conclude that if grep, in its default mode, is asked to emit
a matchingpattern that would contain encoding errors, that it does not trim the
output to whatwould encode correctly and continue onward, but rather emits 
nothing for
that match,continues onward looking for more matches that it can emit
correctly, and thenprints the "*Binary file ... matches*" error just before it 
exits or
goes to thenext file.

If I were designing grep from scratch, and had infinite resources, I
might refer tohave grep emit some substring of each match that it can encode
correctly, ratherthan emit nothing in case of an encoding error.

However, I can't imagine that this is worth the effort, and
(being a stickin the mud old fart) I usually recommend against incompatible 
changes
unless strongly necessary.

So ... whatever ... nevermind ... as they say.

--
                Paul Jackson
                address@hidden



reply via email to

[Prev in Thread] Current Thread [Next in Thread]