CEO and founder of Lingohub. Envisioning a multilingual digital world. Email me if you have questions about how Lingohub can help you take your products global.

Helmut Product Updates 1 Comment

SHARE

Do you speak UTF-8?

“ @ mperham : You have a problem, your data is in latin1 so you think : ” I’ll convert to UTF8 !” Now you have � problems .” cc @ kingshy_g

Everyone of us coders who dealt with encodings felt that pain, didn’t you?

During my developers career I was quite lucky. Had to deal with encodings quite seldom.
And if I had to: Ok, it wasn’t my fault. The provider of the data had chosen (by not knowing it better) that exotic encoding. But I was in charge to solve this problem!

Actually for me the whole encoding issue feels like a neverending Y2K bug.
We have the proper encodings nowadays, but we as computerists were not able to bring this topic to an end.

While reading different resource file formats LingoHub has to deal with this subject:

  • Java resource bundles are stored in ISO8859-1 with UTF-16 escapes
  • iOS strings are stored in UTF-16 (sometimes you have to guess: little/big endian)
  • XML: encoding=”UTF-8“. Good idea! But this could be a lie (by copy/paste)
  • some other formats do not have a defined encoding, nor you have any metadata that give you that information. So you have to know in your application which encoding it will be

Ok, Ok. This was just a rant and won’t give you any solutions.
I will finish it for today and will start this topic as a series of posts to give you some ideas how we solved some of our issues in the encoding domain.