[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Koha-zebra] Re: Import Speed
From: |
Sebastian Hammer |
Subject: |
[Koha-zebra] Re: Import Speed |
Date: |
Thu, 02 Mar 2006 11:05:44 -0500 |
User-agent: |
Mozilla Thunderbird 1.0.7 (Macintosh/20050923) |
Joshua,
Importing records one at a time when first building a database, or when
doing a batch update that is a substantial percentage of the size of the
database is not a good idea. The software has no way to optimize the
layout of the index files, so for each record update, things get
shuffled around, resulting on very sluggish update performance and a
less-than-ideal layout inside the index files.
It would be highly advisable to do at least the initial import from the
command-line. I think it would make a lot of sense if this could be done
well from the protocol, but AFAIK, the extended service interface at the
moment only allows you to insert one record at a time.
Can we just process the raw MARC? Why did we choose the '.xml'
storage method in Zebra and is it a good choice? Would '.sgml' or
'.marc' be a better choice (because we could batch import directly
instead of '.xml's one-at-a-time)? Could we somehow use '.marc' for
the import and then switch to '.xml'?
That's a good question. You use .xml because extended services only work
with XML. It *may* be possible to ingest records from the command-line
as grs.marcxml (which reads MARC records and renders them internally as
MARCXML), then do subsequent updates as XML, doing the conversion on the
client side. I say *may*, because I haven't tried that, but I think it'd
be worth a shot and it should be easy to make the experiment:
1: Start with a sample of MARC records
2: Build the initial index like so:
% zebraidx init
% zebraidx -f 10 -n -t grs.marcxml update recordfile (-n disables
the shadow system for this update)
This should run pleasantly fast compared to what you see now.
3: Try to update some records as MARCXML.
--Seb
Any suggestions on how to handle the connection in a more efficient
way?
Cheers,
--
Sebastian Hammer, Index Data
address@hidden www.indexdata.com
Ph: (603) 209-6853