koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] inverted list Proof of Concept [quest for search]


From: Paul POULAIN
Subject: [Koha-devel] inverted list Proof of Concept [quest for search]
Date: Fri May 27 03:01:35 2005
User-agent: Mozilla Thunderbird 1.0 (X11/20041206)

Hi,

After some chat with Kados, I've spent some time to write a proof of concept for my inverted list idea. It's not perfect, but it works quite good. I've commited it in CVS a few minuts ago (misc/build_marc_Tword.pl and C4::SearchMarcTest.pm files)

NPL should run it on their DB, that is really huge, and report how performances are.

How it works :
===============
* create the table marc_Tword with the following structure :
CREATE TABLE `marc_Tword` (
  `word` varchar(80) NOT NULL default '',
  `usedin` text NOT NULL,
  `tagsubfield` varchar(4) NOT NULL default '',
  PRIMARY KEY  (`word`,`tagsubfield`)
) TYPE=MyISAM;
* open a console & type export PERL5LIB & export KOHA_CONF as usual.
* fill this table with misc/build_marc_Tword.pl. Warning, this script uses a very very consumming but very fast method to fill the table : it does everything in memory, then write everything. Another method is provided (& commented), but it's 100x times slower (really !)
* open opac-search.pl and replace
        use C4::SearchMarc;
by
        use C4::SearchMarcTest;
as the API hasn't changed, it will work immediatly.
* go to opac-search (advanced search) & search whatever you want. It Should work fine (except in "Any word" that is not implemented for instance)

LIMITS :
==========
* build_marc_Tword has problem with extended chars (accented ones mainly). So don't be afraid if you get sql errors. They are not a problem for a POC
* search works always order by title, whatever you choose.
* search works only search WORDA and WOARDB, not yet WORDA or WORDB or WORDA except WORDB. Due to structure, those search should cost exactly the same thing as others.
you can test WORDA or WORDB very easily. in SearchMarcTest, at line 240,
replace
        @result = keys %intersect;
by
        @result = keys %union;

Some infos on perfs/size :
============================
On my largest DB (1 900 000 lines in marc_subfield_table), the build_marc_Tword.pl script requires 60s for the 1st SQL read, then 120s for reading the whole table, then another 100s to write the table. It takes up to 350-400 MB of RAM, and at the end, the marc_Tword table is 245 000 lines, for 100MB of disk space. Note we could probably lower this number by checking better %ignore_list.

Tests :
========
A search on a title word (socia*, with a *) that get 4500 results arrives in less than 2 seconds (mysql cache being empty).
A search on another word (chan*) that get 590 results is the same time.
A search on both words (socia* chan*) is 446 results and get the same duration. Doing the same test on a 2.2 install is VERY different : my SCSI hard disk does a lot of noise & the result for socia* chan* requires almost 20 seconds to appear.

Testing with a OR (chan* OR socia*) get 4930 results, in probably less than 2 seconds.

(Those results could probably be improved a little, as there are a lot of SQL queries to get item number & status & biblio subtitle.)

Note that in the logs, you'll get something for every sql->execute (with the sql executed). You can see that most queries are not about the search itself, but about the item number & status.

--
Paul POULAIN
Consultant indépendant en logiciels libres
responsable francophone de koha (SIGB libre http://www.koha-fr.org)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]