pan-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-devel] more GMime optimisations


From: Jeffrey Stedfast
Subject: [Pan-devel] more GMime optimisations
Date: 09 Dec 2002 00:41:18 -0500

Okay, so I've been thinking about rewriting gmime-filter-html.c again
for weeks now and I've finally done it (we are now at Generation 3).

the first implementation was a gross hack that didn't even work all that
well

then a few months back I rewrote it to use regex for url pattern
matching and cleaned up the code a lot. the problem with this
implementation was that I needed to strdup() each line before running 3
regex patterns on it to find the first matching regex before writing out
the pre-match buffer, then converting the matching buffer into an <a
href> tag and looping again in case there were any more url patterns
later on the same line. yuck.

what I really wanted at the time was to have a regex lib that allowed me
to compile all 3 patterns into the same context and have it just tell me
which pattern it matched along with the start and end offsets of the
match. As an additional bonus, it'd be great if it could be used on
non-nul-terminated strings by passing it in a buflen argument.
Unfortunately... this lib doesn't exist :\

thinking of the 3 compiled regex patterns being scanned for in parallel
reminded me of a Trie graph, so I turned to Aho-Corasick's Trie
algorithm for pattern matching against multiple strings.

so what I did was to implement gtrie.c (I want to try to get it into
glib?) and url-scanner.c which is a wrapper around gtrie's with some
aditional functionality (gtrie can only be used to scan for literal
strings, nothing as complex as regex).

url-scanner.c scans for a list of literal string patterns using the trie
and then uses user-defined functions for scanning back and forward to
find the start-offset and end-offset of the full pattern. so, in
essence... what we end up with is very similar to that multi-pattern
regex lib I dreamed of :-)

using my url-scanner.c code, html_convert()'s url section became
drastically simpler (and lots faster too!) since we don't have to loop
over 3 regex contexts and strdup the line buffer.

here are the results of converting a test file into html (including
converting anything that looked like a url into an <a href> tag):

test file size: 40559695 bytes

trie:
real    0m18.954s
user    0m18.190s
sys     0m0.300s

regex:
real    12m55.140s
user    12m28.200s
sys     0m4.960s

Does that not scream "holy fuck!" ? :-)

Jeff

-- 
Jeffrey Stedfast <address@hidden>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]