[Koha-zebra] Re: Import Speed

koha-zebra

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-zebra] Re: Import Speed

From:	Sebastian Hammer
Subject:	[Koha-zebra] Re: Import Speed
Date:	Thu, 02 Mar 2006 14:07:04 -0500
User-agent:	Mozilla Thunderbird 1.0.7 (Macintosh/20050923)

Mike Taylor wrote:

Date: Thu, 02 Mar 2006 11:05:44 -0500
From: Sebastian Hammer <address@hidden>

Importing records one at a time when first building a database, or
when doing a batch update that is a substantial percentage of the
size of the database is not a good idea. The software has no way to
optimize the layout of the index files, so for each record update,
things get shuffled around, resulting on very sluggish update
performance and a less-than-ideal layout inside the index files.


Sure, but ...

It would be highly advisable to do at least the initial import from
the command-line. I think it would make a lot of sense if this could
be done well from the protocol, but AFAIK, the extended service
interface at the moment only allows you to insert one record at a
time.


But -- ??  What magic does the command-line import have access to that
ZOOM update doesn't?  Clearly it's using some kind of in-memory
caching to hugely reduce the frequency of disk-writes, but why
shouldn't that also be used the doing a ZOOM update?  Isn't that (part
of) the purpose of delaying the "commit" call?  If not, then we need
to add $conn->option("updateCacheSize" => 100*1024*1024)

'commit' has nearly nothing to do with it.

When you insert a new record, it is scanned, and keys are extracted fromthe record according to the indexing rules found in the .abs file. Thesekeys are written to the disk, sorted, and then merged into the index. Ifyou update a thousand records at the same time, in a single operation,then all of those keys are extracted and written into the indexes in asingle pass, which is much more efficient than doing it in a thousandpasses. When you do a first-time update of a thousand records, thingsare even more efficient because the merging can be dropped entirely --keys are written to the indexes just about as fast as the disk can eatthem with minimum seeks.

Doing multiple, single updates between commits doesn't help here. Keysare still extracted from the records and merged into the indexes oncefor each individual update operation.. the shadow files merely recordand defer the physical 'write' operations, so they can be executedlater, independently of the 'read' operations. Update a thousand recordsone at a time between commits and you just end up with some horriblycomplex shadow files, and the system will probably have to work prettyhard to do the commit step.

It would be much better if we had a new stage between the updating ofrecords and the commit... something to allow us to transfer a largenumber of records (preferably more than one per operation to cut down onthe round-trip traffic), THEN index them, THEN commit the changes. Thenwe'd be able to do remote updates as efficiently as we can do them locally..


Something like

1. Update, update, update, update.....
2. Index
3. Commit

Etc.

The problem, as far as I can tell, is that you can only transfer recordsone at a time in the present extended services system, and they'reextracted and indexed one at a time.. this is a fine way to update,well, one record at a time, but it's just about the worst way possibleto update 1000 records at a time.

It shouldn't be hard to implement either -- all the hard work hasalready been done, the logic just needs to be implemented differently.

Mike, you can speak to Adam about this over lunch if you get a chance..it is possible that I misrepresent what happens -- but this reflects myunderstanding.


--Seb

_/|_     ___________________________________________________________________
/o ) \/  Mike Taylor  <address@hidden>  http://www.miketaylor.org.uk
)_v__/\  "If I write in C++ I probably don't use even 10% of the language,
         and in fact the other 90% I don't think I understand" -- Brian
         W. Kernighan.


--
Sebastian Hammer, Index Data
address@hidden   www.indexdata.com
Ph: (603) 209-6853

[Prev in Thread]

Current Thread

[Next in Thread]

[Koha-zebra] Import Speed, Joshua Ferraro, 2006/03/02
- [Koha-zebra] Re: Import Speed, Mike Taylor, 2006/03/02
- [Koha-zebra] Re: Import Speed, Sebastian Hammer, 2006/03/02
  - [Koha-zebra] Re: Import Speed, Mike Taylor, 2006/03/02
    - [Koha-zebra] Re: Import Speed, Sebastian Hammer <=
    - [Koha-zebra] Re: Import Speed, Mike Taylor, 2006/03/04

Prev by Date: [Koha-zebra] Re: Import Speed
Next by Date: [Koha-zebra] Re: Import Speed
Previous by thread: [Koha-zebra] Re: Import Speed
Next by thread: [Koha-zebra] Re: Import Speed
Index(es):
- Date
- Thread