[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pan-users] Re: Big XML files... (was Re: Re: Better processing of very
From: |
Duncan |
Subject: |
[Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?) |
Date: |
Sat, 4 Jul 2009 15:37:51 +0000 (UTC) |
User-agent: |
Pan/0.133 (House of Butterflies) |
Ron Johnson <address@hidden> posted
address@hidden, excerpted below, on Fri, 03 Jul 2009 21:56:36
-0500:
> Also (and maybe because I'm a DBA), this problem just *screams* for
> SQLite and a database in the "First Normal Form".
[ OK, this is a very long post, I know (tho I haven't counted the lines,
200? 250? More? I'll let pan show me that when I post and download it).
But reading it and following even a few of the included tips should
vastly improve your pan experience. =:^) Following all of them... well,
that's up to you, but it works well for me! ]
Actually, before the C++ rewrite (the original was C coded) and the
changes that allowed pan to scale to millions of headers/overviews per
group from 100k, Charles' plan was, for quite some time, to eventually
switch to just that, an sqlite backend.
I don't know why he didn't, except that in the 3-ish years during which
pan seemed to be abandoned that we later learned he used at least part of
to do the rewrite, several others (K. Haley I believe being one of them)
began to experiment with pan, and some of those folks were database folks
(I'm not sure if K. Haley is one of /them/). By the time Charles
announced the C++ rewrite (aka new-pan, what we use now), there had
actually been some preliminary numbers posted to the pan-devel list, and
I think that by using some of the data management techniques that
Charles /did/ use in new-pan, he actually got it to "reasonably" scale
(now, it /does/ work when you throw even several million headers at it,
with memory use scaling accordingly, before, 100k headers was bad, and
above 200k, pan would literally sit there for days, not really increasing
memory usage too badly, but just not getting anywhere -- it simply didn't
scale at all above 200k headers or so, memory or no memory), and the
numbers probably looked reasonably close to the preliminary database
numbers as well -- at least close enough that he judged it not worth the
trouble, with the clear benefit of plain text files.
But, meanwhile, for those dealing with those huge groups, there's some
usage patterns that work rather better than others, and thus some usage
patterns that users should avoid in the large groups, if they want a
reasonably working pan.
# 1 most important, particularly since pan is a GNOME family app and as
many Ubuntu users can attest, PAN AND THE GNOME ASSISTIVE TECHNOLOGIES
APPLET DO NOT GET ALONG WELL AT ALL!!! When that applet is running, it
apparently polls /something/ often enough to keep pan from making
efficient progress at header sorting, in particular. What might
otherwise take 30 seconds or maybe two minutes (still long enough), ends
up taking half an hour... two hours... more... So if you're running
that, do yourself a favor and at LEAST shut it off when running pan.
Either that, or switch to something other than pan, as the two simply
don't get along. For more details, see the list archives.
With that out of the way...
The key to working /with/ pan on large groups, not fighting it, thus
making the problem worse, is to understand what takes it the most time,
and do what you can, including changing your behavior where necessary, to
minimize that issue. (Yeah, I know, but the alternatives are to simply
wait pan out, as at least it /does/ continue making progress now, unlike
with the old code, or to switch to something other than pan. If you'd
rather do the latter than change your behavior while using pan, well,
there /are/ other solutions out there, tho none quite like pan.)
So here's a bit of a peak behind the curtain, explaining in plain English
a bit of what pan's actually doing...
What takes pan the longest time (and uses the most memory too, I believe)
isn't the actual downloading of either messages or headers/overviews, but
sorting those overviews/headers, plugging new ones in at the correct
location in the thread or multipart message as necessary, doing the
subject and author string manipulations that help it keep a reasonable
handle on memory, etc.
It saves that threaded list on exit (of the group or pan itself), so it
doesn't have to rethread existing overviews when it comes back to them,
but when it starts up, once that list gets above a certain size, it still
does enough disk churning verifying the list, and checking what's in
cache so the little cached icon displays correctly, that it gets painful
on a cold cache. (Once the data's all in cache, unless it's flushed, pan
starts up quite fast.)
Here's how I know the effect of that. I take advantage of the fact that
pan checks the PAN_HOME environmental variable when it starts to see
where its config is (defaulting to ~/.pan2 if the variable isn't
populated, I'm not sure what it does if it's populated, but the contents
aren't a sane path) to run several separate pan instances, each with its
own config. On my text group instance, I set no-expire on the overviews/
headers, and expanded the cache (the setting for that isn't in the GUI,
so it's a direct config file edit, preferences.xml) from the default 10
MB to a couple gig, so I could save a decent history. I have posts going
back a couple years in several groups, and on some of the gmane list2news
list archive groups, I have the entire group history as it appears at
gmane. Thus, I have quite a number of overviews/headers archived, but
(for my text instance) they're all text groups, so it's only... half a
gig or so of actual message cache.
Loading that text instance of pan, cold cache, takes probably a couple or
three minutes of disk thrashing -- and that's on a 4-spindle RAID-6, so
it goes MUCH faster than it would on a typical single-spindle pan storage
dir machine. Of course, as I continue to accumulate message overviews
and history, that load times continues to increase. =:^( But once pan is
loaded and thus the cache hot, I can quit pan and restart close enough to
instantly that I don't notice the delay.
As a result, and here's tip #2, I load pan (the text instance) with my
KDE session and keep it running more or less constantly, as long as I'm
in X. I have 8 gigs RAM, so it's no big deal there, and if I do
something that flushes cache, with pan running, I don't lose all of it,
at least, so while it might take a few extra seconds to start up, it's
not like it is from a cold cache.
So tip #2 is, if your header/overview and cache is large enough that the
pan start time is getting uncomfortably long, consider starting it with
your desktop session, letting it load while you do other things. Then
it'll be loaded when you get to it. Even if you then quit pan, as long
as it hasn't been quit for too long and the cache flushed, it'll restart
far faster, since most of that data will still be in cache. But it's
generally far more effective to keep pan running while doing anything
disk cache intensive, than it is quit pan, and restart it afterward.
This is because pan doesn't take so much memory once all that data is
loaded -- it's the loading from disk that's a pain.
It should be noted that a good portion of this time, however, would be
avoidable, if I (1) hadn't fiddled with the default 10 MB cache, and (2)
had the overview/header expiry set to something more "reasonable".
That's tip #3, then. There's a tradeoff between saved headers/overviews
(and to a lessor degree message cache, but even with a default 10MB
message cache, loading the headers from cold-disk-cache takes time) and
from-disk load time. For binaries especially, once you've processed
them, you don't tend to need the headers any more, so I STRONGLY
recommend a reasonably short expiry, and for even more effective control
of the problem, DELETING MESSAGES (not simply letting them mark-as-read
and expire naturally) AS YOU ARE DONE WITH THEM. Of course, as I said,
that's really more workable with binary groups than with text, as often,
you want to keep text around for awhile. But you can still set the
expiry as short as you can reasonably manage for text groups, which
should be all it'll affect on general purpose text/binary instances if
you use the delete binaries immediately when done rule, because they
should already be deleted by the time the expiry comes round.
This #3 is in fact probably the most critical (other than #1) for active
binary users, especially on servers such as Giganews, with such high
retention. If you start actively deleting headers/overviews for binaries
when you are done, and set expiry (which will now affect text-only, since
you've deleted the binaries) as short as possible, say two weeks, you
WILL notice a difference! Here, we're talking startup time, but as we'll
see, it affects overview/header update time as well.
OK, time to explain a bit more about pan's processing. Once it has an
existing list of threaded messages, when it updates headers/overviews, it
takes a bit of time to plug the new ones into the appropriate place in
the existing list. Obviously, the larger the existing list and the more
new ones that came in in the update, the longer this sorting process is
going to take.
That's where tip #3 affects update as well. If your existing header/
overview list is shorter, because you manually deleted the ones you were
done with, pan's processing time will be shorter as well. Thus, it does
NOT pay to keep a list of already processed binary group headers/
overviews around between sessions (incompletes that you're waiting for
completion of being an exception), as that just complicates pan's job,
making it take longer to do that processing than it has to. Again,
delete messages (headers/overviews) in the binary groups as soon as
you're done saving off the binaries and otherwise processing them. It
makes a HUGE difference!
Tip #4. For high volume binary groups, or on high retention servers, for
ALL binary groups, when you first browse them, DO NOT DOWNLOAD ALL
HEADERS/OVERVIEWS AT ONCE. Unfortunately, pan has a get the latest N
days/number-of-headers option, but not a get the oldest N option. Thus,
if you're wanting to go back quite some time, get the N latest, process
what you can (thus in accord with tip #3, deleting the ones you're
finished with), then get the next N latest, process them (again deleting
what you're finished with), until you've gone back as far as you wish or
hit the retention limit. Like #3, the reason here is simple. Keep the
number of overviews pan has to deal with at one time to something
reasonable.
Tip #5. The implications of #3 and 4 should be clear enough. Don't let
unseen messages in a binary group build up unnecessarily between visits.
Just because the server you use has the retention to let you visit a busy
binary group every couple of weeks, doesn't mean you're going to be
making it easy on pan -- and thus on yourself waiting for pan -- if you
wait two weeks between visits. Every day is nice, tho of course there
will be days you'll be doing other stuff and don't get to it. But for
the busy groups, do at least try to get to them twice a week or so, and
if they're indeed that busy, expect a bit of extra trouble if you're
waiting even that long.
It follows then, that if it has been awhile since you visited a group,
and you know it's a busy group, you may find the incremental approach of
tip #4 useful to avoid having pan taking such huge bites at once.
Tip #6. This one isn't directly related to the above or to this problem,
but it's generally useful and helps with this problem. It's simple
enough. Remember that changing groups triggers pan's save group state
functionality, as does quiting and restarting pan, but that takes longer
and is more hassle. Thus, when processing large groups either text or
binary, it can be wise to periodically switch to a different group and
back, just so pan saves the state of where you were, and if pan or the
system crashes for some reason, you'll only lose track of the read and
deleted messages back to the last time you switched groups. When you're
processing thousands of overviews, having pan or the system crash and
lose state on a couple thousand overviews worth of work isn't fun, so
avoid it, by switching out of and back into the group every 200-500
overviews worth (numbers that seem to work well for me).
As you'll note, I mentioned that pan loses delete state. When you delete
a message, it deletes the message itself in cache immediately, but again,
doesn't update the group state until you switch groups. If you crash
before that, the overviews/headers will show up again (but without the
cached messages) as undeleted and probably unread (unless you'd read
them, switched out of the group and back, then deleted them, in which
case they'll show up as read since pan had that state stored when you
switched out and back in).
Tip #7 follows both from #6, and as a consequence of 3-5. Turn off pan's
get-new-headers-when options (under preferences, behavior tab, groups).
In particular, you don't want it auto-fetching new headers/overviews when
you switched out and back into a group just so pan would update its disk-
saved state. However, I've also found that it works better if you let
pan startup, then switch to a group and manually get new headers, then
when pan's finished with that, switch to the next group... etc. Again,
don't give pan too many things to do at once and it works better.
(Fortunately for those of us using them, it does seem to cope reasonably
well with multiple servers, since it keeps only one common threaded list,
not one per server.)
Tip #8. Again, this is a general pan tip. Don't use the mark-entire-
group-read functionality, either in preferences (when leaving group, when
exiting pan) or manually. Due to the way modern servers work (new posts
can come in numbered below the group's sequential high-water-mark), this
is broken on many of them and you'll miss posts as a result. It seems to
have other somewhat unpredictable but generally undesired effects as
well. Just don't use it, and avoid them.
Instead, when you are done with a group, you can select-all (headers),
and use the mark read, or delete (tip #3 again) on them. One caveat with
this has to do with ignored and otherwise view-filtered posts. Since
they're not displayed, select-all won't select them, and they won't be
marked as read or deleted. For that reason, I keep all the match scores
options enabled in the view, header pane submenu, and depend on the color-
coding in the scores column to alert me to score, including ignored. For
groups with many ignored messages, however, it may be easier to either
leave the match ignored off until the end, or to sort by score
(unthreading if necessary) and deal with them first.
Tip #9. This one helps to counteract the negative effects of tips #4 and
5. You can use pan's command-line options to tell it to fetch headers
and quit. pan headers:group.name (as revealed in the help text,
pan --help, I don't actually use this one myself, you may need --no-gui
too). You can then create a script that fetches all the headers from all
your groups for you, and use cron or other scheduler to run it
periodically, say every hour or two, or just once, say an hour before you
get home from work. While that'll accumulate headers to some extent
negating the previous tips, it'll be automated and you won't have to wait
for pan to do that sorting, as it'll already be done when you get there.
Unfortunately, pan does not yet have a similar command line (or
preferences) option to let you auto-download the messages themselves.
There's discussion of adding the feature, based on the score category (so
you could download only watched messages, for instance), and Charles was
the one who actually mentioned that, so he's definitely thinking about
it, but it hasn't been implemented yet.
OK, that's the main tips, tho some more optional usage-style ones
follow. As I said, #1 is most important for those running GNOME, as
pan's hardly workable if that assistive technologies thing is running.
#3 is most critical after that, with #6 and 7 being low-cost-bit-effect
tips. Follow them all, and I'm quite sure you'll see a marked
improvement, especially if you were doing all of them differently, before.
Now for the optional, usage style related ones.
Tip #I. As mentioned, it's possible to setup multiple independent pan
"instances", with separate settings, cache, everything. What I did here
is create a few simple pan starter scripts (bash), calling them pan.bin,
pan.text, and pan.test, the first two obviously for text and binary
instances, the latter to use when I'm "just browsing", since pan doesn't
fully erase group history when you delete messages and unsubscribe, and I
can manually blow it away much easier when I don't have to worry about
blowing away regular group history at the same time. Each session script
can simply set and export the PAN_HOME environmental variable pointing to
its separate data (and config) dir, before starting pan. (Here, I do a
couple other things as well, like set the gtkrc locations using a
different config var, since I use kde and that's not always set
correctly, and I HATE the default color theme pan comes up with if it
doesn't get those settings.)
For shared settings file, such as my scorefile and the accels.txt
keyboard shortcut config, symlinks work wonderfully, and I only have the
one common config file to worry about for all three. Otherwise, the
separate instances use the files in their respective data dirs.
As pertaining to pan efficiency, Tip #I is useful because it allows me to
keep separate text and binary settings, using those most efficient (or
that I simply prefer) for each. As you'll note in the tips below, that
does help.
Tip #II. Again as mentioned, it's possible to change the default 10 MB
message cache size. I already mentioned that I keep the text instance
message cache at several gigs, and set no-expire for the servers, altho
that would interfere with efficient binary processing. Below I'll
explain the way I handle binaries. The setting is in the preferences.xml
file, in the data dir, as set above or ~/.pan2/ by default. The setting
is (the 5120 value being for my text instance):
<int name='cache-size-megs' value='5120'/>
Tip #III. In combination with the multiple instances and custom cache
size of tips #I and II, how I actually deal with binaries is a bit
different than outlined in tips 3-5. I set a very large cache, actually
a dedicated binary message cache partition, 12 gigs, with the cache-size
option set accordingly, and do the following:
Instead of using the normal download function which automatically saves
the files and then deletes the cached-messages (but not the headers/
overviews, those are marked read), I prefer the download to cache
function. What I'll do is download the headers/overviews, sort thru
them, deleting what I know I don't want, sometimes downloading a sample
here and there of stuff I'm not sure about, thus allowing me to delete
the entire series without actually downloading it, if I don't like it,
then select-all (or do it with a reasonable size group of overviews at a
time, if I want to split it into several jobs so I can start working on
the first one before the others are done) and download to cache.
Then I go do something else, maybe go to sleep or to work, or play a
computer game, or catch up with my text instance. Whatever. Anyway, I
come back to the binary pan instance later, after it has grabbed
everything and stashed in in cache. (Obviously, this won't work with the
10 MB default cache size as after it hits that 10 MB, it'll be deleting
them as fast as it downloads them! So this only works in combination
with tip #II.)
Then, everything's already downloaded and local, so working with it is
pretty fast! I then go thru and do my sorting, saving what I want,
deleting the messages and headers/overviews as I'm done with them.
This works far better for me than the download-and-auto-save, because
using the download-and-auto-save functionality, everything has to be
saved to an intermediate directory, losing the post context in the
process. When I then go to the intermediate directory, I have the
filenames, but that's it, no who posted it, no date posted, no additional
information that might have been in the subject line, etc. It gets all
mixed up, and besides that, it's awful easy to just keep downloading to
the intermediate dir, without actually going thru and doing the final
processing, with the intermediate dir thus growing and growing, until one
gives up and moves everything off to an unsorted dir somewhere, and
starts over. Pretty soon one has unsorted1, unsorted2...
But by downloading to cache, then working with everything already local,
I can select series and save them all to their final location directly.
As I do so, since I'm working from pan itself, I still have all that
extra message metadata, who posted it, when, what they said in the
subject, etc, if I want to use any of that information in deciding where
I'm going to save to, or if I want to create a text file there with
additional information. All that would be lost if I used the auto-save
functionality and was trying to sort out the jumble of files that
ultimately ends up there.
Typically what I'll do is setup the downloads for all my usual binary
groups, then do whatever. The message cache thus must be big enough to
contain all the downloaded messages from all groups. When I come back, I
can start working thru them, deleting the messages and their headers as
I'm done with them.
When I'm totally done, I shut down pan (well, the binary instance), and
manually delete the message cache itself. Then the next time I start the
binary pan instance, it's starting with an entirely clean cache. Because
I've deleted the headers/overviews as I went as well, pan doesn't have
but the few I left as incomplete still around to try to properly thread
new messages into, when I restart. All it has is the individual article
numbers that it has already seen (and that I deleted) as tracked in the
newsrc files for each server.
With a clean cache, and no or only the partially complete headers/
overviews to worry about, even with a million or two headers coming down
in an update, pan performance stays MUCH faster than it would be if it
were trying to plug that million or two headers into an existing thread
structure of 10 or 20 million headers! It still takes a bit of time, but
given the number of headers, that's entirely reasonable.
It's also worth noting that doing it this way, pan's not trying to
download the messages, and decode and save the binaries, both at the same
time. It downloads them to cache only, then later, I come back and do
the decode and saving bit. This makes both steps individually faster,
since neither pan nor the slow disk is having to try to deal with both at
the same time.
So there's actually a number of benefits to doing it this way. As
mentioned, I still have access to the post metadata when I'm trying to
sort the binaries into their final location. That's pretty nice on its
own. But it also means pan is far more efficient at processing things,
since it doesn't have a huge buildup of cruft. Third, I can set it up
and let pan do its downloading while I do something else, and when I do
come back and deal with it, it's all local, thus much faster to access.
Finally, when pan's downloading, that's all it's doing, it's not trying
to decode and save at the same time. And when it's decoding and saving,
it's not downloading at the same time. Well, except for those few
samples I download individually, before I set it to work on the big batch
download.
The negative is that the encoded messages take up more room than just the
binaries do. With yEnc, it's only 5% or so, so that's not too bad, but
UUE and MIME/Base64 are both 33% overhead, so you need a bit over four
gigs of cache to store only 3 gigs of actual binary files. But disk
space is cheap these days...
Still, while that's the way that works best for me, it's obviously not
everyone's style, or pan would default to downloading to cache, instead
of the download and save default it currently has. But that's why I
listed these three tips separately and marked them as distinctly
optional. It does work well, but it's not for everybody. Meanwhile, if
people just use tips 1-9, or even just 1 and 3 mainly, it'll likely
improve their experience dramatically, even if they don't choose to do
the whole separate pan instances, huge cache, download-to-cache, then go
thru and save, thing.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
- Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), (continued)
- Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), Ron Johnson, 2009/07/03
- Re: Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), Steven D'Aprano, 2009/07/03
- Re: Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), Steven D'Aprano, 2009/07/03
- Re: Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), Ron Johnson, 2009/07/03
- Re: Big XML files... (was Re: [Pan-users] Re: Better processing of very large groups?), Steven D'Aprano, 2009/07/04
- [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Matej Cepl, 2009/07/04
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Ron Johnson, 2009/07/04
- [Pan-users] If SQLite has NFS locking problems... (was Re: Big XML files... (was ...)), Ron Johnson, 2009/07/04
- [Pan-users] Re: If SQLite has NFS locking problems... (was Re: Big XML files... (was ...)), Duncan, 2009/07/04
- Re: [Pan-users] Re: If SQLite has NFS locking problems... (was Re: Big XML files... (was ...)), Ron Johnson, 2009/07/04
- [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?),
Duncan <=
- [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Duncan, 2009/07/04
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Ron Johnson, 2009/07/04
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Joe Zeff, 2009/07/04
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Steven D'Aprano, 2009/07/04
- [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Matej Cepl, 2009/07/05
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Steven D'Aprano, 2009/07/05
- [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Matej Cepl, 2009/07/07
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Ron Johnson, 2009/07/05
- Re: [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), walt, 2009/07/05
- [Pan-users] Re: Big XML files... (was Re: Re: Better processing of very large groups?), Matej Cepl, 2009/07/04