pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-users] Re: Searching headers problem


From: Duncan
Subject: [Pan-users] Re: Searching headers problem
Date: Mon, 15 Nov 2004 01:18:19 -0700
User-agent: Pan/0.14.2.91 (As She Crawled Across the Table)

Molotov posted <address@hidden>, excerpted below,  on Sat,
13 Nov 2004 13:07:36 +0100:

> Searching headers is not working on my system. In fact, I see the
> "sorting articles" (it is in french) task in the status bar, but nothing
> happens. I can use the menus or read an article (I mean Pan doesn't
> hang), but no matter how many hours I let it "work": I cannot do a
> search...
> I am using a Giganews account, so I have always a big number of headers.
> For example, I tried to find the articles containing a word in the
> subject line yesterday. There is still nothing. There are 1 200 000
> headers on this group.

Pan does not scale well to >~1M overviews in a group.  That's one of the
current problems that one would hope should eventually be addressed with
the switch to the database backend.  I don't use the search function often
enough to know where it becomes essentially non-functional in terms of
scaling, myself, but knowing how dramatically PAN slows down with a
million overviews (incorrectly aka "headers"), it wouldn't surprise me in
the least if your problem was simply that it's spending so much time
juggling memory that it basically fails to make headway on a million
overview search.  If it's taking even a tenth of a second a piece churning
memory, for a million overviews, that's a hundred thousand seconds, or
over a day's worth (27.778 hours) of churning!  Unfortunately, managing a
million overviews, a tenth of a second of churning on each one isn't out
of the realm of possibility, given the current setup.

(One of the optimizations they are looking at for the database backend is
to merge copies of strings.  Currently, they aren't compressed at all, and
an author with a thousand multi-part messages averaging 3 parts apiece
means 3000 duplicates of that author's name in memory!  Likewise with
subject, and other often duplicated headers.  That's 3000 separate
instances of the author header that must be searched, if you are searching
for author, for just one author, whether or not that author matches your
search!  Manage that string using conventional database optimization
techniques, and instead of 3000 searches on the SAME author, it becomes
ONE search, on ONE string in memory, with the others being simply
placeholders that point to it.  Searching the 3000 entries for that single
author, therefore, might take only twice as long as searching one entry,
once for the actual search, and in the same time, noting that 2999 entries
point to that already searched entry.  Subjects of course are generally
much longer, but not identically duplicated as many times, tho similar
substring optimization techniques could also be used.  Thus, searching
3000 similar subject entries might take say five times as long as
searching just one, longer than the twice as long for the fully identical
author entries, but still FAR shorter than the 3000 times as long that it
takes now.  Compound THAT effect with the effect of saving all that memory
and the lower memory churn THAT will mean, and it becomes QUITE obvious
that some MAJOR optimizations are possible.  However, implementing that
database backend and all those optimizations is a MAJOR undertaking. 
While there's some work being done on it, it'll take a serious investment
of time and effort before it bears fruit.)

One thing you /could/ try, if you can't find a sufficiently low overview
count group to experiment in, is to copy a couple thousand posts to a
folder (or use your pan.sent folder if it's not to small/big).  PAN treats
folders almost exactly like groups.  Therefore, you could see if search
actually works with a couple thousand overviews.  If it doesn't, you have
a serious problem with functionality that /should/ be working.  If it
works with 2000, try 10K, then 100K, then 500K, then a million messages,
and you should be able to see just where it starts to bog down to
unworkable levels.  Again, if it's working with a million messages in a
folder, but not with a million in a group, there's a problem (bug), but I
expect you'll see similar behavior, and that it's simply the scaling issue
that PAN is known to have, at this point, unfortunately.

It /is/ better than it was!  PAN /used/ to have problems at 100K messages
similar to the problems it has now with a million, so it's an order of
magnitude better than it WAS.  Scaling issues remain a serious problem,
however, and I'd guess that's what you are seeing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"They that can give up essential liberty to obtain a little
temporary safety, deserve neither liberty nor safety." --
Benjamin Franklin






reply via email to

[Prev in Thread] Current Thread [Next in Thread]