Ruby and Encoding

This is the second post of my series "Do you speak UTF-8?". Here you can find the first article on this topic.

This article covers how Ruby 1.9 handles encoding internally and which tooling it provides for encoding issues. Prior to Ruby 1.9 a String was just a sequence of bytes. Calling the method size() returned the size of this byte array, not the character count. In Ruby 1.9 the Encoding is stored along with that byte array.

As you can see in this example: You may ask for the encoding of a String. size() returns the actual character count. bytesize() gives you the actual number of bytes. How these two representations differ can be seen if you compare the codepoints and the bytes.

[gist id=2818443]

So this looks fine. But what to do if you want to change the encoding of a String eg. to write it to a file in a specific format. To change the encoding of a String you can use String#encode. But you have to be aware that there are some pitfalls to that. There is not always a possibility to concatenate Strings having different encodings. This rule is true if one of the characters exceeds the ASCII-7 charset (having a byte value bigger than 127). A not allowed concatenation will lead to a Encoding::CompatibilityError.

As mentioned in the prior post: text files do not have that meta information. There is no information about the encoding stored along with the file. The same restriction applies to source files. The effect of this issue can be seen if you change the encoding of a source file that includes non ASCII strings inside a text editor.

But not only text editors have to know the encoding of strings to store the correct byte sequence to disk. The Ruby interpreter has to know how to decode the read bytes. To add this meta information Ruby 1.9 allows to specify this information as a comment in the first line of the source code:

[gist id=2815831]

Speaking about files. How do I specify the encoding of a file that I want to read from disk? File.open() allows you to assign the encoding that will be used to read and encode the file data. In this example the read string will have the encoding specified at open().

[gist id=2831484]

In the second part of the above code example you can see that there is the possibility to specify the internal encoding of this file operation. The read string will be transcoded to that given format.

Doing this makes life really easier. For most applications it is a best practice to work with just one internal encoding. Preferably 'utf-8'.

Sources:

Try lingohub 14 days for free. No credit card. No catch. Cancel anytime