aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aspell-user] aspell-<LANG>: Invalid UTF-8 sequence at position...


From: Martin Swift
Subject: Re: [Aspell-user] aspell-<LANG>: Invalid UTF-8 sequence at position...
Date: Sat, 3 Mar 2007 21:16:08 +0900
User-agent: Mutt/1.5.13 (2006-08-11)

On Sat, Mar 03, 2007 at 04:29:15AM -0700, Kevin Atkinson wrote:
> The word list is likely in iso-8859-1 but Aspell expects it in utf-8. 

Indeed:

  # file de*
  de_affix.dat:   ISO-8859 text
  de_AT.multi:    ASCII text
  de_AT-only.cwl: data
  de_CH.multi:    ASCII text
  de_CH-only.cwl: data
  de-common.cwl:  data
  de.dat:         ASCII text
  de_DE.multi:    ASCII text
  de_DE-only.cwl: data
  de.multi:       ASCII text
  de_phonet.dat:  ISO-8859 English text
  deutsch.alias:  ASCII text

> Your locale settings _should_ not have an effect here.  What does have an 
> effect is the setting the the language data file "de.dat", in particular 
> "data-encoding".  See
>   http://aspell.net/man-html/The-Language-Data-File.html

>From that page:

  data-encoding

    The encoding the language data files are expected to be in as well
    as the default encoding to use when saving the personal
    dictionaries. It can be either `utf-8' or any of the 8-bit
    encoding that Aspell supports. If not set, then it defaults to
    charset.

I hope not to offend, but I found that paragraph a little terse..

 * Should it be: "The encoding *of* the language data files"?
 * "are expected to be in as well as..." Expected to be in what?
 * Should it be: "as well as the default encoding *used* when saving"

Does this mean that aspell expects the word lists to have the same
charset as the machine? Isn't that a little odd?

de.dat sets 'charset' as iso-8859-1:

  # cat de.dat 
  # Generated with Aspell Dicts "proc" script version 0.50.1
  name de
  charset iso-8859-1
  soundslike de
  affix      de

Does aspell not use this to determine the charset? If not, /shouldn't/
it?

I just tried

  /usr/bin/prezip-bin -d < de-common.cwl | /usr/bin/aspell --lang=de create 
--encoding=iso8859-1 master ./de-common.rws

which completed without any errors, producing de-common.rws. As it is
quite late here in Japan, I don't have any more time tonight to work
on this.

A couple of questions:

  Is this going to conflict with my machines character encoding, or
has aspell created an rws file for a utf-8 system?

  Is the machine character encoding check a feature? It really seems
that since one might attemp to install the same wordlist on machines
with different character encodings that this is prone to failure.

-- 
\u270C




reply via email to

[Prev in Thread] Current Thread [Next in Thread]