[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Koha-devel] switching from marc_words to zebra [LONG]
From: |
Paul POULAIN |
Subject: |
[Koha-devel] switching from marc_words to zebra [LONG] |
Date: |
Mon Jul 4 11:10:06 2005 |
User-agent: |
Mozilla Thunderbird 1.0.2 (X11/20050317) |
As discussed with Joshua on irc, here is my views on how to move from
2.2 mysql-based DB to 3.0 zebra-based DB :
2.2 structure :
===============
in Koha 2.2, there are a lot of tables to manage biblios :
* biblio / items / biblioitems / additionalauthors / bibliosubtitle /
bibliosubject : they contains the data in a "decoded form". It means,
not depending on marc flavour. A title is called "biblio.title", not
NNN$x where NNN is 245 in MARC21 and 200 in UNIMARC ! The primary key
for a biblio is biblio.biblionumber.
* marc_biblio, is a table that contains only a few informations :
- biblionumber (biblio PK)
- bibid (marc PK. It's a design mistake I made, for sure)
- frameworkcode (the frameworkcode is used to know which cataloguing
form Koha must use, see marc_*_ structure below)
* marc_subfield_table : this table contains marc datas, one line for
each subfield. With "order" row to keep trace of MARC field & subfields
order (to be sure to retrieve repeated fields in good order).
* marc_word : the -huge- table that is the index for searches. The
structure works correctly with small to medium data size, but 50 000
complete MARC biblio is the upper limit the system can handle.
* marc_*_structure (* being tag & subfield) : this table is the table
where the library defines how it's MARC works. It contains, for each
field/subfield a lot of informations : what the field/subfield contains
"1st responsability statement", where to put it in a given framework (in
MARC editor), if the value must be put in "decoded" part of the DB
(200$f => biblio.author in UNIMARC), and what kind of constraint there
is during typing (for example, show a list of possible values)
2.2 DB API
==========
The DB API is located in C4/Biblio.pm for biblio/items management &
C4/SearchMarc.pm for search tools.
The main word here is : "MARC::Record heavy use".
All biblios are stored in a MARC::Record object, as well as items. Just
be warned that all items informations must be in the same MARC tag, thus
the MARC::Record contains only one MARC::Field.
In UNIMARC, it's usually the 995 field and in MARC21, the 952. (can be
anything in Koha, but all item info must be in the same field, and this
field must contain only item info)
Biblio.pm :
------------
All record creation/modification is done through NEWxxxyyyy sub.
NEWxxx
perldoc C4/Biblio.pm will give you some infos on how it works.
In a few words : when adding a biblio, Koha calls NEWnewbiblio, with a
MARC::Record.
NEWnewbiblio calls MARCnewbiblio to handle MARC storing, then
MARCmarc2koha to create structure for non-MARC part of the DB, then
calls OLDnewbiblio to create the non-MARC (decoded) part of the biblio.
SearchMarc.pm :
---------------
The sub that does the search is catalogsearch. The sub parameters :
$dbh, => DB handler
$tags, => array with MARC tags/subfields : ['200a','600f']
$and_or, => array with 'and' or 'or' for each search term
$excluding, => array with 'not' for each search term that must be excluded
$operator, => =, <=, >=, ..., contains, start
$value, => the value to search. Can have a * or % at end of each word
$offset, => the offset in the complete list, to return only needed info
$length, => the number of results to return
$orderby, => how to order the search
$desc_or_asc, => order asc or desc
$sqlstring => an alternate sqlstring that can replace
tags/and_or/excluding/operator
The catalogsearch retrive a list of bibids, then, for each bibid to
return, find interesting values (values that are shown in result list).
This includes the complete item status (available, issued, not for loan...)
move to ZEBRA
=============
DB structure : biblio handling
------------------------------
I think we can remove marc_biblio, marc_subfield_table, and marc_word
(of course)
marc_biblio contains only one important information, the framework (used
to know which cataloguing form Koha must use in MARC editor). It can be
moved to biblio table.
marc_subfield_table contains marc datas. We could either store it in raw
iso2709 format in biblio table or only in zebra. I suspect it's better
to store it twice (in zebra AND in biblio table). When you do a search
(in Zebra), you enter a query, get a list of results. This list can be
builded with datas returned by zebra . Then the user clics on a given
biblio to see the detail.
Here, we can read the raw marc in biblio table, and thus not needing a
useless zebra called (at the price of a SQL query, but it's based on the
primary key, so as fast as possible)
marc_word is completly dropped, as search is done with zebra.
DB structure : items handling
-----------------------------
item info can be stored in the same structure as for biblio : save the
raw item MARC data in items table.
Koha <=> Zebra
--------------
It should really not be a pain to move to zebra with this structure :
every call with a MARC::Record (NEWxxxxyyyy subs) manages the storing of
the MARC::Record in marc_* tables. We could replace this code with a
zebra insert/update, using biblio.biblionumber as primary key.
How to manage biblios and items ? My idea here would be to store biblio
+ all items informations in zebra, using a full MARC::Record, that
contains biblio and items.
When NEWnewitem (or NEWmoditem) is called, the full biblio MARC::Record
is rebuilded with biblio MARC::Record and all items MARC::Records, and
updated in zebra. it can be a little cpu consuming to update zebra every
time an item is modified, but it should not be so much, as in libraries,
biblios & items don't change so often.
So we would have :
NEWnewbiblio :
* create biblio/biblioitems table entry (including MARC record in raw
format)
* create zebra entry, with the provided Perl API.
NEWnewitem :
* create the items entry (includint MARC record in raw format)
* read biblio MARC record & previously existing items // append new item
// update zebra entry with the provided Perl API.
NEWmodbiblio :
* modify the biblio entry (in biblio/biblioitems table)
* read the full MARC record (including items) // update Zebra entry
NEWmoditem :
* modify the item entry (in items table)
* read the full MARC record (including items) // update Zebra entry
Note that this relys on Zebra iso2709 results returns. We could use XML
or nice-top-tech possibilities. But Koha makes heavy use of
MARC::Record, so we don't need to reinvent the wheel.
What is great with Zebra is that we can index iso2709 datas, but show
what we want to users (including XML). So Koha internals can be whatever ;-)
The MARC Editor
===============
Some users thinks Koha MARC Editor could be improved. The best solution
would be, imho, to provide an API to use an external MARC editor if the
library prefers.
However, some libraries are happy with what exists. So the MARC editor
should be kept (& improved where possible). so marc_*_structure tables
are still needed. Some fields could be removed probably, as they are
related to search (like seealso), and will be handled by zebra config
file. This still has to be investigated.
For libraries that prefers an external MARC editor, we could create a
webservice, where the user does an http request, with iso2709 data in
parameters, with the requested operation.
This should be quite easy to do (the problem being to know how the
external software can handle this. If someone has an idea or an
experience on this, feel free to post here ;-) )
Data search
==========
I won't speak a lot about search, as someone else has taken the ball for
this ;-) I just think SearchMarc.pm should be deeply modified ! As every
information will be in zebra, it can use only zebra search API.
A question remaining is :
in a biblio/item, the item status (when issued, transfered, returned,
reverved, waiting...), changes quite often. So is it better to save the
status in zebra DB, and thus update the zebra entry (biblio+items)
everytime an item status is modified, or is it better to keep this
information only in items/reserve/issues tables & read it in mySQL every
time it's needed ?
Open question that zebra guys can probably answer. NPL has, for example,
600 000 issues per year (and hopefully 600 000 returns ;-) ), plus some
(how many ?) reserves, branch transfers...
The authority problem
=====================
Authorities have to be linked to the biblio that uses them. Thus, when
an authority is modified, all biblios using them are automatically
modified (script in misc/merge_authority.pl in Koha cvs & 2.2.x)
To keep trace of the link, Koha uses a $9 local subfield. In UNIMARC,
the $3 can also be used for this. I don't know if something equivalent
to $3 exists in MARC21 (could not find information on
http//www.loc.gov/marc/)
Many scripts make a heavy use of marc_subfield_table $9 data. For
example, when you find an authority in authority module, you get the
number of biblios using this authority. This number is calculated with a
SQL request on $9 subfield.
To handle this with zebra, we have 2 solutions :
- create a table just with the link (biblionumber / authority number)
that we could query
- query zebra with exact $9 subfield value
I don't know zebra enough to be sure of the best way to do it. Any
suggestion/experience welcomed.
The authority problem (another one...)
======================================
Authorities are MARC::Records too... (without items)
So they also have auth_structure & auth_word & all the infos that are in
biblios (except items level, as there is no "authority" items).
so we could imagine to have 2 zebra databases : one for biblios and one
for authorities.
Everything previously in this mail can be copied here. That's something
we could investigate after moving MARC biblios to zebra, as we would
have more experience on this tool.
"Trivial" querying
================
Someone may ask "why should we keep biblio/biblioitems/items tables ?",
as everything is in zebra ?
First, as Koha is multi-marc, remember it's very complex to know what is
a "title" just with a MARC record.
the same guy will ask "yes, but with Biblio/MARCmarc2koha, you can
transform you MARC::Record to a semantically meaningful hash".
I answer to this :
Yes, but without those tables, sql querying the database would be
completly impossible for developpers, as we could not know in mySQL "if
we have authors filled by the bulkmarcimport", or "do we have the
itemcallnumber correctly modified for item #158763".
That's a second reason to keep those tables in mySQL.
--
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)
- [Koha-devel] switching from marc_words to zebra [LONG],
Paul POULAIN <=