classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performing Encoding Discovery


From: Meskauskas Audrius
Subject: Re: Performing Encoding Discovery
Date: Fri, 09 Sep 2005 21:11:22 +0200
User-agent: Mozilla Thunderbird 1.0.2 (Windows/20050317)

The idea is impressive, but the automated discovering of encoding may be something outside the boundaries of the standard Java API and hence outside the scope of this project.

Craig Combs wrote:

Does anybody have a better way of doing this?
I would like to search a set of documents but the documents can be UTF-8, SHIFT-JIS, or US_ASCII. In order for me to index the files correctly I need to know the encoding of the file so I can convert to unicode because query searches are converted to Unicode and the query must be encoded the same way the document was indexed. My only thought is to take a sampling of bytes and then covert the bytes in each language to create a probability matrix of the encoding. Now I was planning on using the default encoding of the system and the languages defined for use in the directory for stemming and stop word analyzing for hints to the encoding. Does gnu.java.nio have any functions for giving back probable charsets of a byte stream? Otherwise I suppose I'll just write my own. -Craig

*//*






reply via email to

[Prev in Thread] Current Thread [Next in Thread]