classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Performing Encoding Discovery


From: Craig Combs
Subject: Performing Encoding Discovery
Date: Fri, 9 Sep 2005 11:50:18 -0700 (PDT)

Does anybody have a better way of doing this?
 
I would like to search a set of documents but the documents can be UTF-8, SHIFT-JIS, or US_ASCII.  In order for me to index the files correctly I need to know the encoding of the file so I can convert to unicode because query searches are converted to Unicode and the query must be encoded the same way the document was indexed.
 
My only thought is to take a sampling of bytes and then covert the bytes in each language to create a probability matrix of the encoding.  Now I was planning on using the default encoding of the system and the languages defined for use in the directory for stemming and stop word analyzing for hints to the encoding. 
 
Does gnu.java.nio have any functions for giving back probable charsets of a byte stream?  Otherwise I suppose I'll just write my own.
 
-Craig

Meskauskas Audrius <address@hidden> wrote:

Craig Combs wrote:

> Two questions:
>
> 1) If I read this correct the classes in the package will determine
> the encoding of input and convert it to a specified encoding that I
> specify without needing to know the orginal encoding of the file? Of
> course depding that I can find a match and if I can not I assume it
> throws and exception or defaults to the system encoding.

The FileReader reads using native encoding. If you need to read a file
using some other charset, read from

new InputStreamReader(new FileReader(myFile), Charset.forName("my
charset") ).

> 2) can classpath be incldued in a library of application say a search
> engine without making the search engine GPL. I'm using lucene and
> would like to keep it under Apache and not GPL. Can some clarify what
> an independ module a little be more?

When GNU Classpath is used unmodified as the core class library for a
program written in the java programming language it does not affect the
licensing for distributing this program directly.

See http://www.gnu.org/software/classpath/license.html for details.

Regards
Audrius.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]