Ruby and Encoding

Developer resources

Helmut Juskewycz

CEO & Founder of LingoHub

Last updated

5/30/2012

Read time

6 min

Best for

Developers

Table of content

This is the second post of my series “Do you speak UTF-8?”. Here you can find the first article on this topic.

This article covers how Ruby 1.9 handles encoding internally and which tooling it provides for encoding issues. Prior to Ruby 1.9 a String was just a sequence of bytes. Calling the method size() returned the size of this byte array, not the character count. In Ruby 1.9 the Encoding is stored along with that byte array.

As you can see in this example: You may ask for the encoding of a String. size() returns the actual character count. bytesize() gives you the actual number of bytes. How these two representations differ can be seen if you compare the codepoints and the bytes.

[gist id=2818443]

So this looks fine. But what to do if you want to change the encoding of a String eg. to write it to a file in a specific format. To change the encoding of a String you can use String#encode.
But you have to be aware that there are some pitfalls to that. There is not always a possibility to concatenate Strings having different encodings. This rule is true if one of the characters exceeds the ASCII-7 charset (having a byte value bigger than 127). A not allowed concatenation will lead to a Encoding::CompatibilityError.

As mentioned in the prior post: text files do not have that meta information. There is no information about the encoding stored along with the file. The same restriction applies to source files. The effect of this issue can be seen if you change the encoding of a source file that includes non ASCII strings inside a text editor.

But not only text editors have to know the encoding of strings to store the correct byte sequence to disk. The Ruby interpreter has to know how to decode the read bytes. To add this meta information Ruby 1.9 allows to specify this information as a comment in the first line of the source code:

[gist id=2815831]

Speaking about files. How do I specify the encoding of a file that I want to read from disk? File.open() allows you to assign the encoding that will be used to read and encode the file data. In this example the read string will have the encoding specified at open().

[gist id=2831484]

In the second part of the above code example you can see that there is the possibility to specify the internal encoding of this file operation. The read string will be transcoded to that given format.

Doing this makes life really easier. For most applications it is a best practice to work with just one internal encoding. Preferably ‘utf-8’.

Sources:

Developers

6 min

Ensuring proper Java character encoding of byte streams

In this article, we will discuss the difficulties faced in Java character encoding and provide solutions to overcome them.

Developers

7 min

Ruby: ensure_encoding to ensure your encoding

Finding a file's unknown encoding can be tricky. Discover the solution we found at LingoHub. Click to read our blog post and learn more about it here.

Developers

6 min

Do you speak UTF-8?

Dealing with encodings in resource file formats can be frustrating for coders. Read this article to discover our experiences and solutions to this issue.

Ruby and Encoding

Related articles