[Pan-users] Re: Big XML files... (was Re: Re: Better processing of very

pan-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pan-users] Re: Big XML files... (was Re: Re: Better processing of very

From:	Duncan
Subject:	[Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?)
Date:	Sat, 4 Jul 2009 15:37:51 +0000 (UTC)
User-agent:	Pan/0.133 (House of Butterflies)
Ron Johnson <address@hidden> posted
address@hidden, excerpted below, on  Fri, 03 Jul 2009 21:56:36
-0500:

> Also (and maybe because I'm a DBA), this problem just *screams* for
> SQLite and a database in the "First Normal Form".

[ OK, this is a very long post, I know (tho I haven't counted the lines, 
200? 250? More? I'll let pan show me that when I post and download it).  
But reading it and following even a few of the included tips should 
vastly improve your pan experience.  =:^) Following all of them... well, 
that's up to you, but it works well for me! ]

Actually, before the C++ rewrite (the original was C coded) and the 
changes that allowed pan to scale to millions of headers/overviews per 
group from 100k, Charles' plan was, for quite some time, to eventually 
switch to just that, an sqlite backend.

I don't know why he didn't, except that in the 3-ish years during which 
pan seemed to be abandoned that we later learned he used at least part of 
to do the rewrite, several others (K. Haley I believe being one of them) 
began to experiment with pan, and some of those folks were database folks 
(I'm not sure if K. Haley is one of /them/).  By the time Charles 
announced the C++ rewrite (aka new-pan, what we use now), there had 
actually been some preliminary numbers posted to the pan-devel list, and 
I think that by using some of the data management techniques that 
Charles /did/ use in new-pan, he actually got it to "reasonably" scale 
(now, it /does/ work when you throw even several million headers at it, 
with memory use scaling accordingly, before, 100k headers was bad, and 
above 200k, pan would literally sit there for days, not really increasing 
memory usage too badly, but just not getting anywhere -- it simply didn't 
scale at all above 200k headers or so, memory or no memory), and the 
numbers probably looked reasonably close to the preliminary database 
numbers as well -- at least close enough that he judged it not worth the 
trouble, with the clear benefit of plain text files.

But, meanwhile, for those dealing with those huge groups, there's some 
usage patterns that work rather better than others, and thus some usage 
patterns that users should avoid in the large groups, if they want a 
reasonably working pan.

# 1 most important, particularly since pan is a GNOME family app and as 
many Ubuntu users can attest, PAN AND THE GNOME ASSISTIVE TECHNOLOGIES 
APPLET DO NOT GET ALONG WELL AT ALL!!!  When that applet is running, it 
apparently polls /something/ often enough to keep pan from making 
efficient progress at header sorting, in particular.  What might 
otherwise take 30 seconds or maybe two minutes (still long enough), ends 
up taking half an hour... two hours... more...  So if you're running 
that, do yourself a favor and at LEAST shut it off when running pan.  
Either that, or switch to something other than pan, as the two simply 
don't get along.  For more details, see the list archives.

With that out of the way...

The key to working /with/ pan on large groups, not fighting it, thus 
making the problem worse, is to understand what takes it the most time, 
and do what you can, including changing your behavior where necessary, to 
minimize that issue.  (Yeah, I know, but the alternatives are to simply 
wait pan out, as at least it /does/ continue making progress now, unlike 
with the old code, or to switch to something other than pan.  If you'd 
rather do the latter than change your behavior while using pan, well, 
there /are/ other solutions out there, tho none quite like pan.)

So here's a bit of a peak behind the curtain, explaining in plain English 
a bit of what pan's actually doing...

What takes pan the longest time (and uses the most memory too, I believe) 
isn't the actual downloading of either messages or headers/overviews, but 
sorting those overviews/headers, plugging new ones in at the correct 
location in the thread or multipart message as necessary, doing the 
subject and author string manipulations that help it keep a reasonable 
handle on memory, etc. 

It saves that threaded list on exit (of the group or pan itself), so it 
doesn't have to rethread existing overviews when it comes back to them, 
but when it starts up, once that list gets above a certain size, it still 
does enough disk churning verifying the list, and checking what's in 
cache so the little cached icon displays correctly, that it gets painful 
on a cold cache.  (Once the data's all in cache, unless it's flushed, pan 
starts up quite fast.)

Here's how I know the effect of that. I take advantage of the fact that 
pan checks the PAN_HOME environmental variable when it starts to see 
where its config is (defaulting to ~/.pan2 if the variable isn't 
populated, I'm not sure what it does if it's populated, but the contents 
aren't a sane path) to run several separate pan instances, each with its 
own config.  On my text group instance, I set no-expire on the overviews/
headers, and expanded the cache (the setting for that isn't in the GUI, 
so it's a direct config file edit, preferences.xml) from the default 10 
MB to a couple gig, so I could save a decent history.  I have posts going 
back a couple years in several groups, and on some of the gmane list2news 
list archive groups, I have the entire group history as it appears at 
gmane.   Thus, I have quite a number of overviews/headers archived, but 
(for my text instance) they're all text groups, so it's only... half a 
gig or so of actual message cache.

Loading that text instance of pan, cold cache, takes probably a couple or 
three minutes of disk thrashing -- and that's on a 4-spindle RAID-6, so 
it goes MUCH faster than it would on a typical single-spindle pan storage 
dir machine.   Of course, as I continue to accumulate message overviews 
and history, that load times continues to increase. =:^(  But once pan is 
loaded and thus the cache hot, I can quit pan and restart close enough to 
instantly that I don't notice the delay.

As a result, and here's tip #2, I load pan (the text instance) with my 
KDE session and keep it running more or less constantly, as long as I'm 
in X.  I have 8 gigs RAM, so it's no big deal there, and if I do 
something that flushes cache, with pan running, I don't lose all of it, 
at least, so while it might take a few extra seconds to start up, it's 
not like it is from a cold cache.

So tip #2 is, if your header/overview and cache is large enough that the 
pan start time is getting uncomfortably long, consider starting it with 
your desktop session, letting it load while you do other things.  Then 
it'll be loaded when you get to it.  Even if you then quit pan, as long 
as it hasn't been quit for too long and the cache flushed, it'll restart 
far faster, since most of that data will still be in cache.  But it's 
generally far more effective to keep pan running while doing anything 
disk cache intensive, than it is quit pan, and restart it afterward.  
This is because pan doesn't take so much memory once all that data is 
loaded -- it's the loading from disk that's a pain.

It should be noted that a good portion of this time, however, would be 
avoidable, if I (1) hadn't fiddled with the default 10 MB cache, and (2) 
had the overview/header expiry set to something more "reasonable".

That's tip #3, then.  There's a tradeoff between saved headers/overviews 
(and to a lessor degree message cache, but even with a default 10MB 
message cache, loading the headers from cold-disk-cache takes time) and 
from-disk load time.  For binaries especially, once you've processed 
them, you don't tend to need the headers any more, so I STRONGLY 
recommend a reasonably short expiry, and for even more effective control 
of the problem, DELETING MESSAGES (not simply letting them mark-as-read 
and expire naturally) AS YOU ARE DONE WITH THEM.  Of course, as I said, 
that's really more workable with binary groups than with text, as often, 
you want to keep text around for awhile.  But you can still set the 
expiry as short as you can reasonably manage for text groups, which 
should be all it'll affect on general purpose text/binary instances if 
you use the delete binaries immediately when done rule, because they 
should already be deleted by the time the expiry comes round.

This #3 is in fact probably the most critical (other than #1) for active 
binary users, especially on servers such as Giganews, with such high 
retention.  If you start actively deleting headers/overviews for binaries 
when you are done, and set expiry (which will now affect text-only, since 
you've deleted the binaries) as short as possible, say two weeks, you 
WILL notice a difference!  Here, we're talking startup time, but as we'll 
see, it affects overview/header update time as well.

OK, time to explain a bit more about pan's processing.  Once it has an 
existing list of threaded messages, when it updates headers/overviews, it 
takes a bit of time to plug the new ones into the appropriate place in 
the existing list.  Obviously, the larger the existing list and the more 
new ones that came in in the update, the longer this sorting process is 
going to take.

That's where tip #3 affects update as well.  If your existing header/
overview list is shorter, because you manually deleted the ones you were 
done with, pan's processing time will be shorter as well.  Thus, it does 
NOT pay to keep a list of already processed binary group headers/
overviews around between sessions (incompletes that you're waiting for 
completion of being an exception), as that just complicates pan's job, 
making it take longer to do that processing than it has to.  Again, 
delete messages (headers/overviews) in the binary groups as soon as 
you're done saving off the binaries and otherwise processing them.  It 
makes a HUGE difference!

Tip #4.  For high volume binary groups, or on high retention servers, for 
ALL binary groups, when you first browse them, DO NOT DOWNLOAD ALL 
HEADERS/OVERVIEWS AT ONCE.  Unfortunately, pan has a get the latest N 
days/number-of-headers option, but not a get the oldest N option.  Thus, 
if you're wanting to go back quite some time, get the N latest, process 
what you can (thus in accord with tip #3, deleting the ones you're 
finished with), then get the next N latest, process them (again deleting 
what you're finished with), until you've gone back as far as you wish or 
hit the retention limit.  Like #3, the reason here is simple.  Keep the 
number of overviews pan has to deal with at one time to something 
reasonable.

Tip #5.  The implications of #3 and 4 should be clear enough.  Don't let 
unseen messages in a binary group build up unnecessarily between visits.  
Just because the server you use has the retention to let you visit a busy 
binary group every couple of weeks, doesn't mean you're going to be 
making it easy on pan -- and thus on yourself waiting for pan -- if you 
wait two weeks between visits.  Every day is nice, tho of course there 
will be days you'll be doing other stuff and don't get to it.  But for 
the busy groups, do at least try to get to them twice a week or so, and 
if they're indeed that busy, expect a bit of extra trouble if you're 
waiting even that long.

It follows then, that if it has been awhile since you visited a group, 
and you know it's a busy group, you may find the incremental approach of 
tip #4 useful to avoid having pan taking such huge bites at once.

Tip #6.  This one isn't directly related to the above or to this problem, 
but it's generally useful and helps with this problem.  It's simple 
enough.  Remember that changing groups triggers pan's save group state 
functionality, as does quiting and restarting pan, but that takes longer 
and is more hassle.  Thus, when processing large groups either text or 
binary, it can be wise to periodically switch to a different group and 
back, just so pan saves the state of where you were, and if pan or the 
system crashes for some reason, you'll only lose track of the read and 
deleted messages back to the last time you switched groups.  When you're 
processing thousands of overviews, having pan or the system crash and 
lose state on a couple thousand overviews worth of work isn't fun, so 
avoid it, by switching out of and back into the group every 200-500 
overviews worth (numbers that seem to work well for me).

As you'll note, I mentioned that pan loses delete state.  When you delete 
a message, it deletes the message itself in cache immediately, but again, 
doesn't update the group state until you switch groups.  If you crash 
before that, the overviews/headers will show up again (but without the 
cached messages) as undeleted and probably unread (unless you'd read 
them, switched out of the group and back, then deleted them, in which 
case they'll show up as read since pan had that state stored when you 
switched out and back in).

Tip #7 follows both from #6, and as a consequence of 3-5.  Turn off pan's 
get-new-headers-when options (under preferences, behavior tab, groups).  
In particular, you don't want it auto-fetching new headers/overviews when 
you switched out and back into a group just so pan would update its disk-
saved state.  However, I've also found that it works better if you let 
pan startup, then switch to a group and manually get new headers, then 
when pan's finished with that, switch to the next group... etc.  Again, 
don't give pan too many things to do at once and it works better.  
(Fortunately for those of us using them, it does seem to cope reasonably 
well with multiple servers, since it keeps only one common threaded list, 
not one per server.)

Tip #8.  Again, this is a general pan tip.  Don't use the mark-entire-
group-read functionality, either in preferences (when leaving group, when 
exiting pan) or manually.  Due to the way modern servers work (new posts 
can come in numbered below the group's sequential high-water-mark), this 
is broken on many of them and you'll miss posts as a result.  It seems to 
have other somewhat unpredictable but generally undesired effects as 
well.  Just don't use it, and avoid them.

Instead, when you are done with a group, you can select-all (headers), 
and use the mark read, or delete (tip #3 again) on them.  One caveat with 
this has to do with ignored and otherwise view-filtered posts.  Since 
they're not displayed, select-all won't select them, and they won't be 
marked as read or deleted.  For that reason, I keep all the match scores 
options enabled in the view, header pane submenu, and depend on the color-
coding in the scores column to alert me to score, including ignored.  For 
groups with many ignored messages, however, it may be easier to either 
leave the match ignored off until the end, or to sort by score 
(unthreading if necessary) and deal with them first.

Tip #9.  This one helps to counteract the negative effects of tips #4 and 
5.  You can use pan's command-line options to tell it to fetch headers 
and quit.  pan headers:group.name (as revealed in the help text,
pan --help, I don't actually use this one myself, you may need --no-gui 
too).  You can then create a script that fetches all the headers from all 
your groups for you, and use cron or other scheduler to run it 
periodically, say every hour or two, or just once, say an hour before you 
get home from work.  While that'll accumulate headers to some extent 
negating the previous tips, it'll be automated and you won't have to wait 
for pan to do that sorting, as it'll already be done when you get there.

Unfortunately, pan does not yet have a similar command line (or 
preferences) option to let you auto-download the messages themselves.  
There's discussion of adding the feature, based on the score category (so 
you could download only watched messages, for instance), and Charles was 
the one who actually mentioned that, so he's definitely thinking about 
it, but it hasn't been implemented yet.

OK, that's the main tips, tho some more optional usage-style ones 
follow.  As I said, #1 is most important for those running GNOME, as 
pan's hardly workable if that assistive technologies thing is running.  
#3 is most critical after that, with #6 and 7 being low-cost-bit-effect 
tips.  Follow them all, and I'm quite sure you'll see a marked 
improvement, especially if you were doing all of them differently, before.

Now for the optional, usage style related ones.

Tip #I.  As mentioned, it's possible to setup multiple independent pan 
"instances", with separate settings, cache, everything.  What I did here 
is create a few simple pan starter scripts (bash), calling them pan.bin, 
pan.text, and pan.test, the first two obviously for text and binary 
instances, the latter to use when I'm "just browsing", since pan doesn't 
fully erase group history when you delete messages and unsubscribe, and I 
can manually blow it away much easier when I don't have to worry about 
blowing away regular group history at the same time.  Each session script 
can simply set and export the PAN_HOME environmental variable pointing to 
its separate data (and config) dir, before starting pan.  (Here, I do a 
couple other things as well, like set the gtkrc locations using a 
different config var, since I use kde and that's not always set 
correctly, and I HATE the default color theme pan comes up with if it 
doesn't get those settings.)

For shared settings file, such as my scorefile and the accels.txt 
keyboard shortcut config, symlinks work wonderfully, and I only have the 
one common config file to worry about for all three.  Otherwise, the 
separate instances use the files in their respective data dirs.

As pertaining to pan efficiency, Tip #I is useful because it allows me to 
keep separate text and binary settings, using those most efficient (or 
that I simply prefer) for each.  As you'll note in the tips below, that 
does help.

Tip #II.  Again as mentioned, it's possible to change the default 10 MB 
message cache size.  I already mentioned that I keep the text instance 
message cache at several gigs, and set no-expire for the servers, altho 
that would interfere with efficient binary processing.  Below I'll 
explain the way I handle binaries.  The setting is in the preferences.xml 
file, in the data dir, as set above or ~/.pan2/ by default.  The setting 
is (the 5120 value being for my text instance):

<int name='cache-size-megs' value='5120'/>

Tip #III.  In combination with the multiple instances and custom cache 
size of tips #I and II, how I actually deal with binaries is a bit 
different than outlined in tips 3-5.  I set a very large cache, actually 
a dedicated binary message cache partition, 12 gigs, with the cache-size 
option set accordingly, and do the following:

Instead of using the normal download function which automatically saves 
the files and then deletes the cached-messages (but not the headers/
overviews, those are marked read), I prefer the download to cache 
function.  What I'll do is download the headers/overviews, sort thru 
them, deleting what I know I don't want, sometimes downloading a sample 
here and there of stuff I'm not sure about, thus allowing me to delete 
the entire series without actually downloading it, if I don't like it, 
then select-all (or do it with a reasonable size group of overviews at a 
time, if I want to split it into several jobs so I can start working on 
the first one before the others are done) and download to cache.

Then I go do something else, maybe go to sleep or to work, or play a 
computer game, or catch up with my text instance.  Whatever.  Anyway, I 
come back to the binary pan instance later, after it has grabbed 
everything and stashed in in cache.  (Obviously, this won't work with the 
10 MB default cache size as after it hits that 10 MB, it'll be deleting 
them as fast as it downloads them!  So this only works in combination 
with tip #II.)

Then, everything's already downloaded and local, so working with it is 
pretty fast!  I then go thru and do my sorting, saving what I want, 
deleting the messages and headers/overviews as I'm done with them.

This works far better for me than the download-and-auto-save, because 
using the download-and-auto-save functionality, everything has to be 
saved to an intermediate directory, losing the post context in the 
process.  When I then go to the intermediate directory, I have the 
filenames, but that's it, no who posted it, no date posted, no additional 
information that might have been in the subject line, etc.  It gets all 
mixed up, and besides that, it's awful easy to just keep downloading to 
the intermediate dir, without actually going thru and doing the final 
processing, with the intermediate dir thus growing and growing, until one 
gives up and moves everything off to an unsorted dir somewhere, and 
starts over.  Pretty soon one has unsorted1, unsorted2...

But by downloading to cache, then working with everything already local, 
I can select series and save them all to their final location directly.  
As I do so, since I'm working from pan itself, I still have all that 
extra message metadata, who posted it, when, what they said in the 
subject, etc, if I want to use any of that information in deciding where 
I'm going to save to, or if I want to create a text file there with 
additional information.  All that would be lost if I used the auto-save 
functionality and was trying to sort out the jumble of files that 
ultimately ends up there.

Typically what I'll do is setup the downloads for all my usual binary 
groups, then do whatever.  The message cache thus must be big enough to 
contain all the downloaded messages from all groups.  When I come back, I 
can start working thru them, deleting the messages and their headers as 
I'm done with them.

When I'm totally done, I shut down pan (well, the binary instance), and 
manually delete the message cache itself.  Then the next time I start the 
binary pan instance, it's starting with an entirely clean cache.  Because 
I've deleted the headers/overviews as I went as well, pan doesn't have 
but the few I left as incomplete still around to try to properly thread 
new messages into, when I restart.  All it has is the individual article 
numbers that it has already seen (and that I deleted) as tracked in the 
newsrc files for each server.

With a clean cache, and no or only the partially complete headers/
overviews to worry about, even with a million or two headers coming down 
in an update, pan performance stays MUCH faster than it would be if it 
were trying to plug that million or two headers into an existing thread 
structure of 10 or 20 million headers!  It still takes a bit of time, but 
given the number of headers, that's entirely reasonable.

It's also worth noting that doing it this way, pan's not trying to 
download the messages, and decode and save the binaries, both at the same 
time.  It downloads them to cache only, then later, I come back and do 
the decode and saving bit.  This makes both steps individually faster, 
since neither pan nor the slow disk is having to try to deal with both at 
the same time.

So there's actually a number of benefits to doing it this way.  As 
mentioned, I still have access to the post metadata when I'm trying to 
sort the binaries into their final location.  That's pretty nice on its 
own.  But it also means pan is far more efficient at processing things, 
since it doesn't have a huge buildup of cruft.  Third, I can set it up 
and let pan do its downloading while I do something else, and when I do 
come back and deal with it, it's all local, thus much faster to access.  
Finally, when pan's downloading, that's all it's doing, it's not trying 
to decode and save at the same time.  And when it's decoding and saving, 
it's not downloading at the same time.  Well, except for those few 
samples I download individually, before I set it to work on the big batch 
download.

The negative is that the encoded messages take up more room than just the 
binaries do.  With yEnc, it's only 5% or so, so that's not too bad, but 
UUE and MIME/Base64 are both 33% overhead, so you need a bit over four 
gigs of cache to store only 3 gigs of actual binary files.  But disk 
space is cheap these days...

Still, while that's the way that works best for me, it's obviously not 
everyone's style, or pan would default to downloading to cache, instead 
of the download and save default it currently has.  But that's why I 
listed these three tips separately and marked them as distinctly 
optional.  It does work well, but it's not for everybody.  Meanwhile, if 
people just use tips 1-9, or even just 1 and 3 mainly, it'll likely 
improve their experience dramatically, even if they don't choose to do 
the whole separate pan instances, huge cache, download-to-cache, then go 
thru and save, thing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
[Prev in Thread]
Current Thread
[Next in Thread]
Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), (continued)
Prev by Date: Re: Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?)
Next by Date: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?)
Previous by thread: Re: [Pan-users] Re: If SQLite has NFS locking problems... (was Re: Big XML files... (was ...))
Next by thread: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?)
Index(es):
- Date
- Thread