Re: Performing Encoding Discovery

classpath

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performing Encoding Discovery

From:	Meskauskas Audrius
Subject:	Re: Performing Encoding Discovery
Date:	Fri, 09 Sep 2005 21:11:22 +0200
User-agent:	Mozilla Thunderbird 1.0.2 (Windows/20050317)

The idea is impressive, but the automated discovering of encoding may besomething outside the boundaries of the standard Java API and henceoutside the scope of this project.


Craig Combs wrote:

Does anybody have a better way of doing this?
I would like to search a set of documents but the documents can beUTF-8, SHIFT-JIS, or US_ASCII. In order for me to index the filescorrectly I need to know the encoding of the file so I can convert tounicode because query searches are converted to Unicode and the querymust be encoded the same way the document was indexed.My only thought is to take a sampling of bytes and then covert thebytes in each language to create a probability matrix of theencoding. Now I was planning on using the default encoding of thesystem and the languages defined for use in the directory for stemmingand stop word analyzing for hints to the encoding.Does gnu.java.nio have any functions for giving back probable charsetsof a byte stream? Otherwise I suppose I'll just write my own.-Craig
*//*

[Prev in Thread]

Current Thread

[Next in Thread]

gnu.java.nio, Craig Combs, 2005/09/09
- Re: gnu.java.nio, Meskauskas Audrius, 2005/09/09
  - Performing Encoding Discovery, Craig Combs, 2005/09/09
    - Re: Performing Encoding Discovery, Meskauskas Audrius <=
    - Re: Performing Encoding Discovery, Craig Combs, 2005/09/09

Prev by Date: Re: ImageIO reader based on ImageMagick
Next by Date: Re: Performing Encoding Discovery
Previous by thread: Performing Encoding Discovery
Next by thread: Re: Performing Encoding Discovery
Index(es):
- Date
- Thread