You are here

Unicode and UTF-8

Website Migration Handbook
· ·

The terms "Unicode" and "UTF-8" are sometimes used interchangeably (incorrectly), so I thought it would be useful to describe UTF-8 and the code points of Unicode. In Unicode, characters are represented by a code point between 0 to 10FFFF, so that's 1,114,112 possible code points, although not all are used. These Unicode code points are referred to as a "U+" and then the actual code point. For instance, the letter A is U+0061 and the Chinese character for "center" 中 is U+4E2D. UTF-8 is a way of encoding those 1,114,112 possible code points to a sequence of bytes. In UTF-8, "A" is the same as the ASCII A, or 0x41. UTF-8 is especially good when most of your text is going to be Latin-based, since all normal ASCII characters are represented in just one byte. But UTF-8 can also go to multiple bytes. For instance, the Chinese character mentioned above (U+4E2D) is represented as 0xE4B8AD in UTF-8 (in three bytes). UTF-16 is always at least 2 bytes (16 bits, hence the "16" in UTF-16). If the majority of your site(s) are going to be in non-Latin languages, it may make sense to use UTF-16. Why? Some characters, for instance U+4E2D above, it represented in just two bytes in UTF-16 (0x4E2D).

Website Migration Handbook

Last updated 14 August 2015 (first published 04 November 2007)