Re: Performing Encoding Discovery

It's not outside the boundaries of Java but what is the best algorithm for determining the best "guess" of what the encoding is if you do know it. I doubt it could be 100%. I'm assuming if you developed a program and deployed it a user is only going to have a subset of encoded documents not every encoding in the world. So this helps narrow the possibility. If you can narrow the encoding down to 3 encodings you now have a 33% chance of determining the correct encoding. Now you look at the default system encoding. You can assume the majority of documents will be of this encoding so you will check for this one first. Next Check for UTF-8 (this should always be done). Next you take a sampling buffer size and sample the bytes against common byte patterns for the specified languages. From this you weight the results and with some rules based engine decide what encoding to use. Now what would be even nicer is to be able to cross reference against mistakes so the discovery process can be tuned. I think this is very doable but definitely needs some thought.

I thought I could use gnu.java.io along with lucene to be able to achieve this. The sampling of bytes would have to be some form of search and lucene already performs waiting. My guess is you would take samples bytes and search for stop words in the byte stream that are normally removed in a search for a given language and then the search with the highest score determines the encoding.

Well, thank you for the input.

-Craig Combs

Meskauskas Audrius <address@hidden> wrote:

The idea is impressive, but the automated discovering of encoding may be
something outside the boundaries of the standard Java API and hence
outside the scope of this project.

Craig Combs wrote:

> Does anybody have a better way of doing this?
>
> I would like to search a set of documents but the documents can be
> UTF-8, SHIFT-JIS, or US_ASCII. In order for me to index the files
> correctly I need to know the encoding of the file so I can convert to
> unicode because query searches are converted to Unicode and the query
> must be encoded the same way the document was indexed.
>
> My only thought is to take a sampling of bytes and then covert the
> bytes in each language to create a probability matrix of the
> encoding. Now I was planning on using the default encoding of the
> system and the languages defined for use in the directory for stemming
> and stop word analyzing for hints to the encoding.
>
> Does gnu.java.nio have any functions for giving back probable charsets
> of a byte stream? Otherwise I suppose I'll just write my own.
>
> -Craig
>
> *//*
>

From:	Craig Combs
Subject:	Re: Performing Encoding Discovery
Date:	Fri, 9 Sep 2005 13:51:28 -0700 (PDT)