[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #16305] grep much less efficient when matching multiple patterns th
From: |
Levi Waldron |
Subject: |
[bug #16305] grep much less efficient when matching multiple patterns than when matching each pattern sequentially |
Date: |
Sat, 8 Apr 2006 20:57:19 +0000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20051010 Firefox/1.0.7 (Ubuntu package 1.0.7) |
URL:
<http://savannah.gnu.org/bugs/?func=detailitem&item_id=16305>
Summary: grep much less efficient when matching multiple
patterns than when matching each pattern sequentially
Project: grep
Submitted by: lwaldron
Submitted on: Saturday 04/08/06 at 20:57
Category: None
Severity: 3 - Normal
Item Group: None
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
_______________________________________________________
Details:
I have a list of patterns which is about 4,000 lines like this:
(patterns.txt):
PTPAFFX.131946.1.S1_S_AT
PTPAFFX.11573.1.A1_AT
PTPAFFX.209184.1.S1_AT
PTP.3766.1.S1_AT
PTP.3804.1.S1_AT
And a data file which is about 7,000 lines like this, totalling an approx.
1MB file: (data.txt)
AFFX-BIOB-3_AT 1429.6 2545.4 816.966666666667 1646.9 1698.96666666667
2819.06666666667 1085.33333333333 1915.26666666667 0.99999721095669
0.999997210956687
PTPAFFX.126566.1.S1_AT 2442.5 2636.96666666667 2341.06666666667
2244.76666666667 2604.96666666667 2997.93333333333 2399.96666666667 2207.4
0.999995178917582 0.999995178917537
PTPAFFX.212425.1.S1_AT 496.366666666667 551 430.433333333333 482.466666666667
517.6 642.966666666667 371.533333333333 487.766666666667 0.99989956995976
0.999899569959758
(each line in data.txt starts with the string variable which might match a
pattern in patterns.txt)
Every pattern in pattern.txt has a match somewhere in data.txt. When I run
this search like this:
grep --file=patterns.txt data.txt > matches.txt
it is *extremely* consumptive of memory and CPU. On my 2GHz Celeron with
512MB RAM it uses almost all the 1GB swap space and would take probably 12
hours if I were to let it finish. I've even had the opportunity to run it on
a large Beowulf cluster
(http://www.botany.utoronto.ca/bbc_access/Botany_Beowulf_Cluster.htm) and
after 3 minutes this method still hadn't found a single match.
If I instead run the search one pattern at a time like:
for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done
it uses a small amount of memory and completes all 4,000 matches on my home
computer in maybe 15-30 minutes and on the beowulf cluster in less than 30
seconds.
Perhaps GNU grep would generally run faster with numerous patterns if it
searched for them one at a time? Also, why is this job so CPU and memory
intensive?
Both the cluster and my personal computer are running GNU grep 2.5.1.
_______________________________________________________
Carbon-Copy List:
CC Address | Comment
------------------------------------+-----------------------------
lwaldron |
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?func=detailitem&item_id=16305>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [bug #16305] grep much less efficient when matching multiple patterns than when matching each pattern sequentially,
Levi Waldron <=