[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Koha-devel] a charmed letter about characters....
From: |
Galen Charlton |
Subject: |
Re: [Koha-devel] a charmed letter about characters.... |
Date: |
Thu, 31 Jan 2008 16:54:09 -0600 |
Hi,
On 1/31/08, Galen Charlton <address@hidden> wrote:
> As it happens, at this very moment I am working on some patches to
> improve character set conversion, including adding support for
> converting Latin-1 MARC records to UTF-8 from the command-line import
> jobs. I should have something for you to test later today or
> tomorrow.
This patch (against the current 3.0 tip) is now available for review
at http://manage-gmc.dev.kohalibrary.com/patches/charset
This introduces a new module, C4::Charset, to centralize code required for MARC
character conversion in Koha. From the commit message:
"IMPORTANT - refactor MARC character set handling
Created a new module, C4::Charset, to centralize code
for converting MARC records to UTF8. This module has three
exported functions:
* IsStringUTF8ish - determine if scalar contains a string in UTF8
* MarcToUTF8Record - convert MARC blob or MARC::Record to UTF8
* SetMarcUnicodeFlag - set appropriate MARC21 or UNIMARC field to
indicate that record is in UTF-8.
Design points of this module include:
* No dependencies on other C4 modules, making it easier to add
more test cases
* All character conversion code in one place
* Single entry point for doing a character conversion on a
MARC record
* Capture of errors and warnings produced by Text::Iconv
and MARC::Charset
* Start of support for guessing the source character set of
a MARC record.
Several functions were moved from other scripts
or modules to C4::Charset:
* C4::Koha->FixEncoding (expanded and renamed
MarcToUTF8Record)
* C4::Koha->char_decode5426
* fMARC8ToUTF8 from bulkmarcimport.pl (renamed
_marc_marc8_to_utf8)
Several batch jobs were adjusted to use MarcToUTF8Record instead of
FixEncoding."
As one of the effects of this patch, when the source character set of
a MARC record is not known (e.g., the way bulkmarcimport currently
works now), MarcToUTF8Record will now try converting the record from
MARC-8, and if that results in errors, from Latin 1. However, I also
intend to add an option to bulkmarcimport to explicitly specify the
source encoding.
I will add more test cases as I debug the module, so if any of you
run into problems with character conversion, please file bugs or send
me samples of the MARC records in question.
Regards,
Galen
--
Galen Charlton
Koha Application Developer
LibLime
address@hidden
p: 1-888-564-2457 x709