Ensuring proper Java character encoding of byte streams

Table of contents

The situation with Java character encoding The general solution for java character encoding The Java way (aka Java character encoding for winners) Conclusion

This article looks at Java character encoding challenges and how those can be tackled.

The situation with Java character encoding

Some time ago I wrote about a situation we are facing at Lingohub every day: If a user uploads a resource file or uses our Github & Bitbucket integration to import a file, we always have to find out the correct character encoding.

We always receive a byte stream, nothing more, nothing less. So how should we be able to apply the correct charset (UTF-8, UTF-16LE, UTF-16BE, ISO-8859-1) to transform these bytes into meaningful characters?

But hey, wait! What do you mean "nothing more, nothing less"? You have the file extension so you actually can derive the encoding, right?

Wrong. I already covered why you can never be sure that a file has the encoding as defined by its file type in this article (and showing you how you can handle this problem in Ruby).

I have a saying:

There are two problems we as computer engineers have not solved by 2014:

making it easy to connect your laptop to a projector :)
character encoding

The general solution for java character encoding

At Lingohub had to find a solution for such a situation. Our import must handle *all* resource files regardless of the encoding used. Like above mentioned, there is no actual evidence that indicates the encoding at 100%, however, we found a way that works quite well:

We import the file in binary format
Importer starts to apply an encoding to this byte sequence
If it fails to parse, we try the next encoding...
Repeat until the conversion works out fine

The Java way (aka Java character encoding for winners)

We have implemented this approach in Java in just one single class. The code is shared here: EnsureEncoding.java.

The important part can be seen here:

[gist id=b49e9de83e9657b197a7]

As you can see, we used the option CodingErrorAction.REPORT for java.nio.charset.CharsetDecoder. So the decoder will fail with an Exception if a byte sequence cannot be applied to the actual character encoding.

Other options are:

IGNORE - skips unknown byte sequences
REPLACE - replaces unknown byte sequences with a defined replacement

public interface ContentCheck

But what happens in line 19? As I mentioned above, this approach can never be 100% correct. Problems start if you want to check against UTF-16: Almost every 2-4 byte sequence could be translated to a UTF-16 character. A UTF-8 byte sequence could be converted to UTF-16 characters without hitting an unknown byte sequence.

Just in this case you can be glad that the imported characters just represent crap! So we just have to check the string against known patterns. Eg:

"=" sign in key/value based resource files
"<?xml" in xml based files

Conclusion

To sum it up: we had to choose this approach to Java character encoding over similar existing solutions like juniversalchardet, because these implementations will give you the best match. If this one is wrong you have to stick to it and cannot change your strategy after checking the result. Rechecking if the content makes sense is sometimes crucial.

If you like our approach please feel free contact us. If you want to contribute to this code snippet, let us know, we will create a github project ... and please enjoy this video. If you want to see some of the Java we have put into practice, give Lingohub a try.

Try lingohub 14 days for free. No credit card. No catch. Cancel anytime