bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23763: Bug report: Grep stops, if a text file contains a null charac


From: Bjoern Voigt
Subject: bug#23763: Bug report: Grep stops, if a text file contains a null character after 32768 bytes
Date: Mon, 13 Jun 2016 22:52:38 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40

Eric Blake wrote:
> POSIX allows this behavior, in that it says that grep's behavior is
> undefined on non-text files (which you have by virtue of your NUL
> byte). Since this is documented behavior of GNU grep when -a is not
> used, I'm closing this as not a bug. But feel free to add further
> comments to this thread. 
If I start grep with the "-a" option or "--binary=text", the bug does
not show up.

"grep --binary-files=binary" which is the default shows the bug.

I am relatively sure, that the auto guessing code is incorrect or
limited, if a null character is found after 32KB. The manual page says
about the auto guessing code:

       -U, --binary
              Treat  the  file(s) as binary.  By default, under MS-DOS
and MS-
              Windows, grep guesses the file type by looking at  the 
contents
              of  the first 32KB read from the file.  If grep decides
the file
              is a text file, it strips the CR characters  from  the 
original
              file  contents  (to  make  regular expressions with ^ and
$ work
              correctly).  Specifying -U overrules this guesswork,
causing all
              files  to be read and passed to the matching mechanism
verbatim;
              if the file is a text file with CR/LF pairs at the end 
of  each
              line,  this  will  cause some regular expressions to
fail.  This
              option has no effect on platforms  other  than  MS-DOS 
and  MS-
              Windows.

I see these problems:

 1. The binary mode is implemented inconsistent. It would be acceptable,
    if grep produces none (no match, exit code >0) or exactly one output
    line ("Binary file testfile.txt matches", exit code 0). It is not
    acceptable, that grep writes some matching text lines and later
    "Binary file testfile.txt matches" and exits with code 0.
 2. Linux or more precisely None-MS-DOS and None-MS-Windows users will
    oversee the auto guessing section in manual page, because of the
    notes "By default, under MS-DOS and MS-Windows, grep guesses the
    file type by looking at  the  contents of  the first 32KB read from
    the file." and "This option has no effect on platforms  other  than 
    MS-DOS  and  MS-Windows."
 3. The auto-guessing mechanism is not documented somewhere else in the
    documentation.
 4. The auto guessing limitations are somehow documented in the manual
    page, but not in the BUGS section.
 5. The exit code should not be 0, if grep founds an error in input
    which it can't recover.
 6. The error message "Binary file testfile.txt matches" must not be
    written on standard output, if matching text lines are written before.
 7. POSIX defines minimal assurances for grep. Of course GNU grep can or
    should be better.
 8. Other implementations (like the tested FreeBSD version) do not show
    the bug. Also busybox works correctly.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]