LingoHub Academy

Our experience, knowledge and lessons learned - all here just for you.

What is UTF-8?

UTF-8 is a Unicode Transformation Format that uses 8-bit blocks to represent a character.

Characters here refer to letters in alphabets, numbers and numerical values, punctuation, special symbols (currencies, mathematical symbols, emoji...).

Since 2009, UTF-8 has been the dominant encoding (of any kind, not just of Unicode encodings) for the Internet and as of March 2020 accounts for 95.0% of all web pages. UTF-8 overtook all other encodings in 2008 and over 60% of the web in 2012.

Character Encoding

Character encoding is a must information point for anyone planning on creating content outside of languages which use the basic Latin alphabet characters. Character encoding refers to proper readability of characters within the text - by human users, machines, and search engines.

Computers store characters as single or grouped bytes. Character encoding is the way these bytes (characters) are properly displayed.

It’s important to differentiate between fonts and character encoding - while you might have the proper character encoding on your hands, the selected font might not display the proper character, and instead offer unreadable icons, such as empty squares, question marks, etc.

ASCII basics

In the beginning there was ASCII - American Standard Code for Information Interchange. ASCII is a character encoding standard for digital communication. ASCII includes basic characters, punctuation, numbers and letters which are present in the English alphabet.

However, as the Internet expanded further from the basics of English language, billions of users had little use of Latin characters to access relevant content and ASCII was succeeded by new characters encoding.

It’s helpful to know that the first 256 code points (unique numbers for each character) of ASCII, ISO-8859 and UTF-8 are identical. Although ISO-8859 offers coverage of most languages which use the Arabic script, UTF-8 further expands and covers most living languages and writing scripts in the world.

If you haven’t thought about your target audience (current and future), this is the time to do so.


What is UTF-8?

UTF-8 or Unicode Transformation Format is an extension of ASCII. UTF-8 encodes code points in one to four bytes.

Structure of UTF-8:

  • One byte: The first 128 characters ( corresponding to ASCII characters).

  • Two bytes: The following 1,920 characters require two bytes to encode (which includes a huge majority of Latin-script based alphabets, but also Hebrew, Arabic, Cyrillic, Greek)

  • Three bytes: Includes other characters for languages such as Chinese, Japanese, Korean

  • Four bytes: Includes historic scripts, mathematical symbols and emoji.

Is UTF-8 compatible with ASCII?

The UTF-8 codes for the standard ASCII characters are corresponding. This makes UTF-8 ideal for backwards compatibility with existing ASCII text. However, keep in mind that UTF-8 and UTF-16 are not as compatible.

In general, UTF-8 dominates the web and has been the recommended encoding since HTML5.

Why is this relevant for you?

Since a HTML page can be only in one encoding, UTF-8 is the favorable choice. It supports many languages and allows a mix of different languages on a page.

Since all ASCII characters in UTF-8 have the same bytes, it makes backward compatibility easy as well.

Ready to optimize your translation workflow?