bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#30326: grep not searching through a text file (thinking it binary)


From: L A Walsh
Subject: bug#30326: grep not searching through a text file (thinking it binary)
Date: Fri, 02 Feb 2018 16:51:55 -0800
User-agent: Thunderbird



Paul Eggert wrote:
 On 02/02/2018 03:30 PM, L A Walsh wrote:
> most computer files (vs. user-files) are still single-byte.

 That's because so many of them are ASCII. But ASCII files are not the
 issue here. grep's behavior hasn't changed when operating on ASCII files
 in typical locales. The issue is text using a non-ASCII encoding that is
 not compatible with your locale; e.g., if your text file uses ISO 8859-1
 but your locale specifies UTF-8.
----
   I've had my locale as UTF-8 since around 2000.  My music collection
needed french, english, middle east, and now japanese chars -- so I set things
to UTF-8.  I didn't need perfection.  For the email, I needed to know what
files the text was in so I could look at those mbox's with a mail-reader
or with a text editor.  I needed grep to work as a 1st level search tool.
It's failed on that score.

Still if it just searched for the bytes that I put in the search string, I'm
not sure how it would "go wrong".



 In my experience, UTF-8 has long been winning this battle, in the sense
 that UTF-8 is by far the dominant encoding for the non-ASCII files I
 regularly use. So I use a UTF-8 locale, and suggest this as a good
 default for most users nowadays.

 It's not possible to get direct statistics about encoding for all user
 files. However, we can see what's being published on the web. Currently
 UTF-8 is being used by about 90% of public websites whose character
 encoding can be determined, according to the latest W3Techs survey. ISO
 8859-1 is in second place, at about 4%. See:

 https://w3techs.com/technologies/overview/character_encoding/all

Whereas this one was:
Domain: Non-ISO extended-ASCII text, with very long lines

So theoretically, it would never match any locale.

Problem is on a mailbox, different emails can have different encodings.

But I didn't care -- I typed in an ascii string -- so let it search in octets
w/no encoding.

It's also such that in a mailbox it's very likely there are going to
be lines (maybe "very long lines"), but the text I was searching for
was <80 chars.

I'm really surprised it was decided to break compat -- as I've been
doing searches like this for over 2 decades - not often, mind you, but
it's one of the big advantages for me of keeping mailboxes for my IMAP
server in mbox format.  Maildir format or others would kill search ability
with slow file-IO.  ;^/








reply via email to

[Prev in Thread] Current Thread [Next in Thread]