koha-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Koha-devel] a charmed letter about characters....


From: Galen Charlton
Subject: Re: [Koha-devel] a charmed letter about characters....
Date: Thu, 31 Jan 2008 16:54:09 -0600

Hi,

On 1/31/08, Galen Charlton <address@hidden> wrote:
> As it happens, at this very moment I am working on some patches to
> improve character set conversion, including adding support for
> converting Latin-1 MARC records to UTF-8 from the command-line import
> jobs.  I should have something for you to test later today or
> tomorrow.

This patch (against the current 3.0 tip) is now available for review
at http://manage-gmc.dev.kohalibrary.com/patches/charset

This introduces a new module, C4::Charset, to centralize code required for MARC
character conversion in Koha.  From the commit message:

"IMPORTANT - refactor MARC character set handling

Created a new module, C4::Charset, to centralize code
for converting MARC records to UTF8.  This module has three
exported functions:

* IsStringUTF8ish - determine if scalar contains a string in UTF8
* MarcToUTF8Record - convert MARC blob or MARC::Record to UTF8
* SetMarcUnicodeFlag - set appropriate MARC21 or UNIMARC field to
  indicate that record is in UTF-8.

Design points of this module include:

* No dependencies on other C4 modules, making it easier to add
  more test cases
* All character conversion code in one place
* Single entry point for doing a character conversion on a
  MARC record
* Capture of errors and warnings produced by Text::Iconv
  and MARC::Charset
* Start of support for guessing the source character set of
  a MARC record.

Several functions were moved from other scripts
or modules to C4::Charset:

* C4::Koha->FixEncoding (expanded and renamed
  MarcToUTF8Record)
* C4::Koha->char_decode5426
* fMARC8ToUTF8 from bulkmarcimport.pl (renamed
  _marc_marc8_to_utf8)

Several batch jobs were adjusted to use MarcToUTF8Record instead of
FixEncoding."

As one of the effects of this patch, when the source character set of
a MARC record is not known (e.g., the way bulkmarcimport currently
works now), MarcToUTF8Record will now try converting the record from
MARC-8, and if that results in errors, from Latin 1.  However, I also
intend to add an option to bulkmarcimport to explicitly specify the
source encoding.

I will add more test cases as I debug the module, so if  any of you
run into problems with character conversion, please file bugs or send
me samples of the MARC records in question.

Regards,

Galen
-- 
Galen Charlton
Koha Application Developer
LibLime
address@hidden
p: 1-888-564-2457 x709




reply via email to

[Prev in Thread] Current Thread [Next in Thread]