[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Koha-zebra] Koha Zebra Searching Report (from NPL)
From: |
Sebastian Hammer |
Subject: |
Re: [Koha-zebra] Koha Zebra Searching Report (from NPL) |
Date: |
Wed, 22 Mar 2006 22:43:40 -0500 |
User-agent: |
Mozilla Thunderbird 1.0.7 (Macintosh/20050923) |
Joshua Ferraro wrote:
On Wed, Mar 22, 2006 at 08:28:26PM -0500, Sebastian Hammer wrote:
Can't do XOR today. I suppose it would be a possible new feature, but
I've frankly never heard of it in an ILS.. can a XOR b be mapped to
(a OR b) NOT (a AND b) ? or am I just showing my fading math skills to
ill effect, here?
Yep, that's the correct mapping. Voyager's where NPL originally
saw the XOR function.
Ok. It can be faked in the front-end then, or implemented deeper in the
guts of Zebra.
Why do you see yourelf limited to Bib-1? Within Koha, you can do
whatever you want -- specifically extend Bib-1 into the 8000-range
(IIRC) for local USE attributes or define a private set.
Right, I was just hoping there was some way to map it to bib-1 as
I assume that would be useful in cross-domain searching. If not we
can certainly do a locally defined attribute or set.
I think beyond what's in the Bath profile or the US national profile,
you have little hope of interoperable search.. in my experience,
cross-domain searching still entails the need to do query-mapping
independently per target or for groups of targets with similar
characteristics. I use the CCL parser that's available through the YAZ
ZOOM API, and include a reference to a set of mapping directives as part
of the configuration for each target.. that allows you to get pretty far
towards an interoperable-feeling search with a minimum of code.
This would, I believe, require new development. It's possible that one
of the experimental ranking algorithms that are included might provide
better results for these people, but I *think* that boosting the score
for one field in a ranked keyword search would require an extension to
the index structure.
I've looked high and low for documentation on the ranking algorithms in
Zebra but haven't found much more than a few sentences in the official
docs and some list messages ...
It isn't documented beyond what's in the code, AFAIK.
AUTHOR SEARCHING
Again, the current relevance ranking doesn't quite cut it. A good
example is a relevance ranked author search on "James Joyce". Some
records sneak into high relevance because they have multiple authors
with names like "James Henry" and "Paul Joyce" (take "Bob the Builder
in the NPL database as an example
It might be worth checking whether one of the custom ranking algos did
better on this..you an look in the NEWS file for instructions on how to
enable them.
Will do.
relevance ranking
should account for proximity and use that as the highest ranking
consideration to ensure that a search on "James Joyce" returns all the
books by "James Joyce" first. Also, they requested that the default
ranking secondarily sort the items by date as well because they often
are asked to find the 'latest' book by so and so. We concluded that
the copyright date stored in the 008 is probably the only date
normalized enough to use for sorting though I'm not sure if zebra can
use that for sorting.
It could with the XSLT index rules of Zebra 1.4.
Cool, and are there docs on that somewhere? :-)
There will be by the time Zebra 1.4 is released. For now, it's
pre-release stuff. However, the CVS version of Zebra contains an example
setup under examples/alvis-oai/conf. I think for really gnarly indexing
schemes, this is probably the wave of the future, since it's pretty much
infinitely flexible. It should also be pretty easy to perl-map one of
the existing ABS files into this format.
Same thing. I don't know how hard it would be to add a score for
proximity.. that data is at least in the index structure, but I've no
idea how hard it would be to fit into the code. We can ask the Zebra
wranglers what it would entail if you're interested.
Yes, please do, we're very interested in that particular one.
Ok.
SUBJECT HEADING SEARCH
NPL would like to see a demonstration of a 'Subject Heading' search
using authorities generated from the data to compile a list of
authoritative headings (which would be compiled from multiple fields
within a given subject tag such as $650$a$v$x, etc.). So I think
to do this right we'd need to look at putting our authority records
in Zebra as well.
Hmm. Not sure I fully grok the requirement here.. you seem to suggest
both constructing a specific index key based on a concatenation of
multiple fields (easy in the XSLT indexing rules of 1.4, not compatible
with the 'melm' directive.
I'm unclear about the differences between 'elm' and 'melm'. The docs
seem to indicate that they are the same...
They are actually described as being quite different, but I can see how
the nature of the difference could be more clear.
The 'elm' directive is the original.. it's parameter structure is based
on the way that Z39.50 abstract record models were typically represented
in the old days.. hence the weird ordering of elements, etc. It also has
the limitation that you can't address attributes, because the old Z39.50
record model didn't have attributes. The xelm directive was introduced
to fix that.. it allows you to express tag paths in the XPATH style, and
to address attributes, either in [predicates] or directly, for indexing.
The usmarc.abs file that comes with Zebra assumes that records were
ingested in ISO2709 using the record type grs.marc.<absfilename>. The
grs.marc input filter actually generates an internal abstract structure
which is incompatible with MARCXML.. it looks more like
<245><11><a>content</a></11></245>. When MARCXML came along it became
clear that it'd be nicer to work with that.. so the grs.marcxml input
filter was introduced to parse ISO2709 and map them internally to
MARCXML. Of course, if you're starting with MARCXML, you can just use
grs.xml with the same effect.
But now the old usmarc.abs file won't work anymore, because MARCXML is
all about attributes for field names and subfield codes, and the 'elm'
directive can't handle that... in fact, to index 245$a, you'd have to
write something like
xelm /*/address@hidden/address@hidden title
At some point, we got a bit of money from the LoC to develop a simple
set of Bath level 0 indexing rules for Zebra.. I started working on
that, but got so fed up with the syntax above that I rebelled and
implemented the 'melm' directive (and it takes a lot for me to touch the
innards of Zebra, in my old days), so instead of the above, I could write
melm 245$a title
Which is totally equivalent to the above, but nice and to the point..
however, none of these mechanisms allows you to construct phrase indexes
that span multiple subfields.. and they don't allow you to do cool stuff
like extract a date from the guts of 008... in fact, there are lots of
situations where you'd like to do some form of massaging on the input
before processing. In the past, I would sometimes translate MARC records
to an ASCII-line based format, and use the magic of the regexp input
filters (http://www.indexdata.com/zebra/doc/record-model.tkl#id2530050)
to massage the data at index/retrieval time... because I can write Tcl
code in the input filters to do stuff to the data, the sky is the
limit.. but, because I have to write Tcl code to accomplish anything, I
become sad and gray-haired. So when I build applications on Zebra these
days, I am more likely to do some form of preprocessing of the records
in Perl or similar BEFORE feeding them to Zebra.. not very satisfying,
but it brings home the bacon.
Well, in Zebra 1.4, XSLT comes to the rescue, in a way that only XSLT
can do it, with lots of angular brackets and much verbosity.... for
instance, in an XSLT index filter,
melm 245$a title:w
becomes
<xsl:template
match="marc:record/marc:address@hidden'245']/marc:address@hidden'a']">
<z:index name="title"type="w">
<xsl:value-of select="."/>
</z:index>
</xsl:template>
Eek.
But of course the magic of that is that you could put just about
anything you could possibly imagine instead of that simple
<xsl:value-of> in the middle... using substr() to extract a date from
008, a code from the leader, combining subfields, doing math, looking
stuff up in supporting tables, etc... the sky is the limit, and I'd
prefer this to programming in Tcl anytime. And of course, if you want a
more compact configuration file, you could write something like
<koha:melm field="245$a" index="title:w"/>
and use XSLT to map that into the diatribe above before sending it to
Zebra.. we might even offer some options like that as part of the
software down the road. In addition to the stylesheet which maps records
to 'index documents' like above, Zebra 1.4 can be configured to support
multiple retrieval schemas (i.e. DC, MODS, MARCXML), simply by providing
stylesheets for each desired schema -- the translation is done on the
fly when records are retrieved.
--Sebastian
Thanks!
--
Sebastian Hammer, Index Data
address@hidden www.indexdata.com
Ph: (603) 209-6853
- [Koha-zebra] Koha Zebra Searching Report (from NPL), Joshua Ferraro, 2006/03/22
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Joshua Ferraro, 2006/03/27
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Sebastian Hammer, 2006/03/27
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Mike Taylor, 2006/03/28
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Chris Cormack, 2006/03/28
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Mike Taylor, 2006/03/29
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Adam Dickmeiss, 2006/03/29
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Mike Taylor, 2006/03/29
- Re: [Koha-zebra] Koha Zebra Searching Report (from NPL), Joshua Ferraro, 2006/03/29