[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: removing blank lines: "grep ." is really slow
From: |
Ivan |
Subject: |
Re: removing blank lines: "grep ." is really slow |
Date: |
Sun, 18 Apr 2010 00:32:58 -0400 |
On Apr 16, 2010, at 3:37 AM, Paolo Bonzini wrote:
True. You'd need to expand UTF-8 period characters to the
appropriate character sets, then you can use the faster single-byte
character set matcher. It's on my todo list.
It wouldn't be exactly as fast as your grep -v solution (which is
optimal and preferred) however, because it will check that a
character in the line is a valid UTF-8 character. In particular it
would be slow and have false negatives if you're document is not
UTF-8.
So... right now, "." means "valid UTF-8 character"? Or not? I'm a
little confused about the difference between the current behavior and
the behavior after you accomplish your todo list.
Anyway, I sent my original email because I couldn't think of any non-
buggy reason for "grep ." to take an entire millisecond per line. That
seems insanely slow even if some kind of UTF-8 checking is taking
place. Here are some tests showing the non-linearity that I mentioned
before:
bash$ time yes | head -n 1000 | grep . >/dev/null
real 0m0.311s
user 0m0.224s
sys 0m0.028s
bash$ time yes | head -n 5000 | grep . >/dev/null
real 0m3.730s
user 0m3.125s
sys 0m0.269s
bash$ time yes | head -n 10000 | grep . >/dev/null
real 0m10.282s
user 0m8.646s
sys 0m0.732s
bash$ time yes | head -n 20000 | grep . >/dev/null
real 0m21.156s
user 0m17.883s
sys 0m1.524s
I'm also puzzled by this:
bash$ time yes | head -n 5000 | grep '[a-b]' >/dev/null
real 0m0.159s
user 0m0.053s
sys 0m0.028s
bash$ time yes | head -n 5000 | grep '[y-z]' >/dev/null
real 0m3.755s
user 0m3.089s
sys 0m0.262s
bash$ time yes | head -n 5000 | grep '[yz]' >/dev/null
real 0m0.168s
user 0m0.021s
sys 0m0.028s
Are these behaviors expected?
Ivan