“ @ mperham : You have a problem, your data is in latin1 so you think : " I'll convert to UTF8 !" Now you have � problems .” cc @ kingshy_g
Everyone of us coders who dealt with encodings felt that pain, didn't you?
Character Encoding
Since 2009, UTF-8 has been the dominant encoding (of any kind, not just of Unicode encodings) for the Internet, and as of March 2020, it accounts for 95.0% of all web pages. UTF-8 overtook all other encodings in 2008 and over 60% of the web in 2012.
Character encoding is an information point for anyone planning to create content outside languages that use the basic Latin alphabet characters. Character encoding refers to the proper readability of characters within the text - by human users, machines, and search engines. Computers store characters as single or grouped bytes. Character encoding is the way these bytes (characters) are properly displayed.
It’s important to differentiate between fonts and character encoding - while you might have the proper character encoding on your hands, the selected font might not display the proper character and instead offer unreadable icons, such as empty squares, question marks, etc.
ASCII basics
In the beginning, there was ASCII - American Standard Code for Information Interchange. ASCII is a character encoding standard for digital communication. ASCII includes basic characters, punctuation, numbers, and letters, which are present in the English alphabet.
However, as the Internet expanded further from the basics of the English language, billions of users had little use of Latin characters to access relevant content, and ASCII was succeeded by new character encoding. It’s helpful to know that the first 256 code points (unique numbers for each character) of ASCII, ISO-8859, and UTF-8 are identical. Although ISO-8859 offers coverage of most languages that use the Arabic script, UTF-8 further expands and covers most living languages and writing scripts worldwide.
If you haven’t thought about your target audience (current and future), this is the time to do so.
What is UTF-8?
UTF-8, or Unicode Transformation Format, is an extension of ASCII. UTF-8 encodes code points in one to four bytes.
Structure of UTF-8:
- One byte: The first 128 characters ( corresponding to ASCII characters).
- Two bytes: The following 1,920 characters require two bytes to encode (which includes a huge majority of Latin-script-based alphabets, but also Hebrew, Arabic, Cyrillic, and Greek)
- Three bytes: Includes other characters for languages such as Chinese, Japanese, Korean
- Four bytes: Includes historic scripts, mathematical symbols, and emoji.
Is UTF-8 compatible with ASCII?
The UTF-8 codes for the standard ASCII characters are corresponding. This makes UTF-8 ideal for backward compatibility with existing ASCII text. However, keep in mind that UTF-8 and UTF-16 are not as compatible.
In general, UTF-8 dominates the web and has been the recommended encoding since HTML5.
Why is this relevant for you?
Since an HTML page can be only in one encoding, UTF-8 is the favorable choice. It supports many languages and allows a mix of different languages on a page.
Since all ASCII characters in UTF-8 have the same bytes, it makes backward compatibility easy as well.