koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Koha-devel] Re: MARC character encoding


From: Ed Summers
Subject: [Koha-devel] Re: MARC character encoding
Date: Wed Jan 15 08:35:07 2003
User-agent: Mutt/1.3.28i

Hi Paul:

On Wed, Jan 15, 2003 at 03:20:47PM +0100, paul POULAIN wrote:
> Could someone explain how to translate the "MARC21" charset to a more 
> convenient one (and which is more convenient ?)
> Same question for UNIMARC (which is ISO646 if my docs are right)

If we lived in a perfect world we would all be using Unicode (UTF8)
since it covers so many of the worlds scripts [1]. Unfortunately the
world is not perfect. MARC has been around longer than Unicode, so 
MARC-8 character encoding to allow non Latin scripts to live in MARC 
records. I guess the world has bigger problems than character encodings
(Mr George Bush comes to mind), but I'll leave that particular problem
alone :)

I wasn't aware that UNIMARC had defined a different standard for
character encoding. Isn't ISO646 just an synonym for ASCII? [2] Which docs 
describe the character sets used in UNIMARC?

> I tried MARC-Charset, which seems to translate from "MARC21" to UNICODE, 
> but i don't know what to do with my unicode ;-(

Yes, MARC::Charset is an implementation of the MARC-8 ==> Unicode
(UTF-8) mappings published by the Library of Congress. [3] In MARC-8
there is a special way of 'escaping' to other character sets (Hebrew,
Cyrillic, East-Asian, etc). 

> I tried :
>    my $charset = MARC::Charset->new();
>    print $charset->to_utf8($unimarc->as_formatted())."\n";
> where $unimarc is a MARC::Record containing a MARC21 record converted to 
> UNIMARC (hope everybody understand this : i mapped marc21 fields to 
> unimarc ones)

The to_utf8() method will take a string of characters (encoded in
MARC-8) and convert them to Unicdoe (UTF8). Initially I wanted to do
this so that MARC records could be expressed as XML with UTF8 encoding.

You mapped all the UNIMARC fields to MARC fields!?! I was under the
impression that this was quite a big undertaking to do completely. Is
your code currently checked into CVS? Having a UNIMARC filter in
MARC::Record (MARC::File::UNIMARC) has been a long term goal. Maybe we
could roll this work into the MARC::Record package?

I'm not sure how you want to handle character encoding in Koha. Does
MySQL properly handle Unicode (UTF8)? If it does I think long term
Koha should probably attempt to store everything in UTF8. Luckily UTF8
is backwards compatible with plain vanilla ASCII. Even if MySQL handles
storing UTF8, it would take some research to make sure DBD::mysql also
does.

Otherwise, there needs to be some global config option that determines
Koha's character set encoding. There is the Encode [4] module (standard
with 5.8.0) which could handle translating between a variety of character
encodings.

Hope this helps more than it hurts :)

//Ed

[1] http://www.unicode.org
[2] http://czyborra.com/charsets/iso646.html
[3] http://www.loc.gov/marc/specifications/specchartables.html
[4] http://search.cpan.org/author/DANKOGAI/Encode-1.84/

-- 

% perl -MData::Dumper -e "print Dumper($me)"
$VAR1 = {
          'WEB' => 'http://www.inkdroid.org',
          'NAME' => 'Ed Summers',
          'AIM' => 'inkdroid',
          'EMAIL' => 'address@hidden'
        };

Attachment: pgpaIn4aZRTDs.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]