|
From: | Sebastian Hammer |
Subject: | [Koha-zebra] Re: Import Speed |
Date: | Thu, 02 Mar 2006 14:07:04 -0500 |
User-agent: | Mozilla Thunderbird 1.0.7 (Macintosh/20050923) |
Mike Taylor wrote:
Date: Thu, 02 Mar 2006 11:05:44 -0500 From: Sebastian Hammer <address@hidden> Importing records one at a time when first building a database, or when doing a batch update that is a substantial percentage of the size of the database is not a good idea. The software has no way to optimize the layout of the index files, so for each record update, things get shuffled around, resulting on very sluggish update performance and a less-than-ideal layout inside the index files.Sure, but ...It would be highly advisable to do at least the initial import from the command-line. I think it would make a lot of sense if this could be done well from the protocol, but AFAIK, the extended service interface at the moment only allows you to insert one record at a time.But -- ?? What magic does the command-line import have access to that ZOOM update doesn't? Clearly it's using some kind of in-memory caching to hugely reduce the frequency of disk-writes, but why shouldn't that also be used the doing a ZOOM update? Isn't that (part of) the purpose of delaying the "commit" call? If not, then we need to add $conn->option("updateCacheSize" => 100*1024*1024)
'commit' has nearly nothing to do with it.When you insert a new record, it is scanned, and keys are extracted from the record according to the indexing rules found in the .abs file. These keys are written to the disk, sorted, and then merged into the index. If you update a thousand records at the same time, in a single operation, then all of those keys are extracted and written into the indexes in a single pass, which is much more efficient than doing it in a thousand passes. When you do a first-time update of a thousand records, things are even more efficient because the merging can be dropped entirely -- keys are written to the indexes just about as fast as the disk can eat them with minimum seeks.
Doing multiple, single updates between commits doesn't help here. Keys are still extracted from the records and merged into the indexes once for each individual update operation.. the shadow files merely record and defer the physical 'write' operations, so they can be executed later, independently of the 'read' operations. Update a thousand records one at a time between commits and you just end up with some horribly complex shadow files, and the system will probably have to work pretty hard to do the commit step.
It would be much better if we had a new stage between the updating of records and the commit... something to allow us to transfer a large number of records (preferably more than one per operation to cut down on the round-trip traffic), THEN index them, THEN commit the changes. Then we'd be able to do remote updates as efficiently as we can do them locally..
Something like 1. Update, update, update, update..... 2. Index 3. Commit Etc.The problem, as far as I can tell, is that you can only transfer records one at a time in the present extended services system, and they're extracted and indexed one at a time.. this is a fine way to update, well, one record at a time, but it's just about the worst way possible to update 1000 records at a time.
It shouldn't be hard to implement either -- all the hard work has already been done, the logic just needs to be implemented differently.
Mike, you can speak to Adam about this over lunch if you get a chance.. it is possible that I misrepresent what happens -- but this reflects my understanding.
--Seb
_/|_ ___________________________________________________________________ /o ) \/ Mike Taylor <address@hidden> http://www.miketaylor.org.uk )_v__/\ "If I write in C++ I probably don't use even 10% of the language, and in fact the other 90% I don't think I understand" -- Brian W. Kernighan.
-- Sebastian Hammer, Index Data address@hidden www.indexdata.com Ph: (603) 209-6853
[Prev in Thread] | Current Thread | [Next in Thread] |