nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] nmh internals: full MIME integration


From: Ralph Corderoy
Subject: Re: [Nmh-workers] nmh internals: full MIME integration
Date: Sat, 26 Jul 2014 20:07:12 +0100

Hi Ken,

> > If we're having lazy evaluation of MIME parts, which is good, can it
> > also cover the headers?  `pick --list-id <address@hidden>' isn't
> > concerned with decoding Subject and all those Received headers.  It
> > may not sound like much, but we have folders with tens of thousands
> > of emails.  get_header() could note minimal details of each header
> > it comes across whilst searching for the List-ID but not bother too
> > much about their contents.
>
> I wasn't actually thinking of decoding the headers for things like
> MIME content, at least upon read (I assume you're talking about RFC
> 2047 encoding

No, less than that.  I'm hoping this change will also improve searching
for split-line headers.

    $ grep -A 1 '^foo:' `mhpath .`
    foo: bar
     xyzzy
    $ pick --foo 'bar xyzzy' .
    pick: no messages match specification
    $ pick --foo 'bar  xyzzy' .
    1 hit
    $

pick may have changed a bit since the above version, but I still
shouldn't have to care how much whitespace continuation lines are
indented.  Shouldn't pick be matching against a logical view of a single
line, with `CRLF WS*' becoming a single space?

> Okay, I guess I could see that.  The normal case would be to decode
> the contents completely

Yep, to UTF-8 single lines?

> > the kind of overhead that would be nice to see done only on demand.
>
> I'm still skeptical that you'd even notice (it isn't 1988 anymore!),
> but I think if the API was well designed it should be easy to
> implement.

Well, you might be thinking the 2047-decoding might not make a lot of
difference, whereas I'm thinking a block can be read into a page-aligned
buffer that has an \n beyond it as a sentinel, then check for
/foo[ \t]*:/i, ignore any non-foo headers, hunt for the next \n and repeat
if it's not the sentinel, else read another block and try again.  Stop
if no more blocks or \n\n.  The detail's a bit more complex but there's
no allocation and copying for headers seen along the way;  they'll be
found when they're looked for in turn.  The file's blocks aren't being
modified so no copy-on-write's occurring.

I agree moderness is quick;  this is on about 22,500 emails.

    $ LC_ALL=C \time -v perl -e 'for (<[0-9]*>) {sysopen F, $_, 0 and sysread 
F, $b, 4096 or die}'
            Command being timed: "perl -e for (<[0-9]*>) {sysopen F, $_, 0 and 
sysread F, $b, 4096 or die}"
            User time (seconds): 0.40
            System time (seconds): 0.52
            Percent of CPU this job got: 98%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
            Average shared text size (kbytes): 0
            Average unshared data size (kbytes): 0
            Average stack size (kbytes): 0
            Average total size (kbytes): 0
            Maximum resident set size (kbytes): 24112
            Average resident set size (kbytes): 0
            Major (requiring I/O) page faults: 0
            Minor (reclaiming a frame) page faults: 1688
            Voluntary context switches: 1
            Involuntary context switches: 19
            Swaps: 0
            File system inputs: 0
            File system outputs: 0
            Socket messages sent: 0
            Socket messages received: 0
            Signals delivered: 0
            Page size (bytes): 4096
            Exit status: 0
    $

It would be nice if a simple pick didn't add much to that roughly
one-second 100%-CPU-utilisation wall-clock time.  :-)  Running pick
tends to be an iterative process where the query is honed.

Cheers, Ralph.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]