Does anybody have a better way of doing this?
I would like to search a set of documents but the documents can be
UTF-8, SHIFT-JIS, or US_ASCII. In order for me to index the files
correctly I need to know the encoding of the file so I can convert to
unicode because query searches are converted to Unicode and the query
must be encoded the same way the document was indexed.
My only thought is to take a sampling of bytes and then covert the
bytes in each language to create a probability matrix of the
encoding. Now I was planning on using the default encoding of the
system and the languages defined for use in the directory for stemming
and stop word analyzing for hints to the encoding.
Does gnu.java.nio have any functions for giving back probable charsets
of a byte stream? Otherwise I suppose I'll just write my own.
-Craig
*//*