Intro

Unicode is the best and hopefully last character set.

ASCII uses 7 or 8 bits to encode up to 127 characters (x0-x7F) and is fine was fine for basic English. Character sets (like windows-1252 or iso latin 1) that fully utilize the 8th bit can encode up to 255 characters (x80-xFF) and is sufficient for many European languages. However one byte is insufficient for languages that need more than 255 characters or when dealing with multiple languages. This is where Unicode comes in.

  • The Unicode codespace covers 1,114,112 code points (ranging from x0 to x10FFFF) divided into 17 planes (0-16). Code points 0-65,535 (x0-xFFFF) make up plane 0, the Basic Multilingual Plane (BMP), equivalent to UCS-2.
  • The Unicode code points are implemented by several character encodings:
    • Unicode transformation formats (UTF). Variants include: UFT-1, UTF-7, UTF-8 (the best! Linux), UTF-EBCDIC, UTF-16 (Windows, Java), UTF-32.
    • UCS transformation formats. Variants include: UCS-2 (an obsolete subset of UTF-16), UCS-4 (equivalent to UTF-32).

The Unicode Consortium [unicode.org] works on the "Unicode Project" or the "Unicode Standard". The ISO/IEC works on the Universal Character Set (UCS), aka Universal Character Set (UCS). Both "cooperate" to make Unicode. UCS focuses on the code points, while Unicode also works on issues like collation, normalization of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. In general, Unicode is more comprehensive than UCS (ISO 10646).

All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future.

  • Unicode 1 (1991-10)
  • Unicode 1.1 (1993-06). ISO/IEC 10646-1:1993. Formation of UCS-2.
  • Unicode 2 (1996-07). ISO/IEC 10646-1:1993.
  • Unicode 3 (1999-09). ISO/IEC 10646-1:2000.
  • Unicode 4 (2003-04). ISO/IEC 10646:2003. UCS-4.
  • Unicode 5 (2006-07)
  • Unicode 6 (2010-10). ISO/IEC 10646:2012.
  • Unicode 7 (2014-06)

As a side most of the code points are for complete characters (EG: A = U+0041) or pre-composed characters (EG: Ä = U+00C4). However some of the code points are for combining characters which work similar to dead keys (See Shortcuts) or non-spacing accent keys on a typewriter. EG: Ä = A + a combining diaeresis = U+0041 U+0308. The combining diacritical marks are U+0300 to U+036F.

  • In Windows Character Map, first Select your character, then enter the hexadecimal Unicode code point in "Go to Unicode".
  • In HTML simply follow your character by the encoded accent. EG: œ&769; results in: œ́.

Code Blocks Of Code Points

UCS-2 (Universal Character Set, aka BMP; Plane 0;) uses 16 bits (x0 to xFFFF = 0 to 65,535) for the code points.

UCS-4 uses 31 bits ( x0 to x80000000 = 0 to 2,147,483,648) for code points but is not as popular. UCS-2 is part of ISO-16046-2.

Each Unicode (UCS-2) code point is usually referenced by its hexadecimal value, and is put in this format: U+4DigitHexValue. EG: U+00A5 is the code point for the Yen symbol (¥). Related code elements form groups called scripts. The number space set aside for a script is called its code block. Code blocks usually start at some nice round hexadecimal number. EG:

  • U+0000 through U+007F is the first byte and is known as Basic Latin. The first 7 bits are ASCII and the 8th bit is the remainder of the ISO Latin 1 character set.
  • U+0080 through U+00FF is the Basic Latin-1 Supplement (ISO Latin 1).
  • U+0100 through U+017F is the Latin Extended-A.
  • U+0180 through U+024F is the Latin Extended-B.
  • U+0250 through U+02AF is the IPA Extensions.
  • etc.

Here is approximately how the two bytes are distributed:

General Scripts
|       Symbols
|       |   CJK Auxiliary                          Compatibility
|       |   |   CJK Ideographs                 Private Use     |
|       |   |   |                                         |    |
V       V   V   V                                         V    V
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|***|*  |** |***|***|***|***|***|***|   |   |   |   |   | ##|##*|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
00  10  20  30  40  50  60  70  80  90  A0  B0  C0  D0  E0  F0  FE

+---+             +---+           +---+
|***| Allocated   |   | Reserved  |###| Private Use
+---+             +---+           +---+

Unicode Encoding

To review character set terminology:

  • code elements are real world characters
  • code points are numbers that map to code elements
  • encoding schemes are different systems of representing code points as bits

Unicode has additional complexity:

  • Some real world characters are made up of combinations of code elements. EG: The German umlaut character Ä ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code U+00C4, or alternatively by the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": U+0041 U+0308. This produces more complicated encoding schemes.
  • 16 bit (let alone 31 bit) character sets are difficult to use when many systems assume 7 or 8 bits are used. This lead to the development of various encoding schemes called UTFs (UCS/Unicode Transformation Formats). A UTF is an algorithmic mapping from every Unicode scalar value to a unique byte sequence.

A BOM (Byte Order Mark) may be inserted at the beginning of data streams. The Unicode character U+FEFF, aka ZWNBSP (Zero Width Non-Breaking SPace), is used as the BOM. Simply convert the character into its appropriate encoding scheme. When displayed on the system viewing this page, the FE and FF bytes are displayed as þ and ÿ, respectively. There are several reasons for using a BOM:

  1. ZWNBSP is very rare in non-Unicode files so its mere presence indicates a Unicode file
  2. To indicate the byte-order of the file (big-endian or little-endian).
  3. To indicate that the encoding scheme of the Unicode file.

Unicode has the following encoding schemes:

  • SCSU. Standard Compression Scheme for Unicode.
  • UTF-1. This was fiddled around for a bit, but it has been dropped for ISO 16046.
  • UTF-2. Obsolete name for UTF-8.
  • UTF-7. Aka: csUnicode11UTF7; unicode-1-1-utf-7; x-unicode-2-0-utf-7; 65000; RFC-2152. Encodes UCS-2 with octets using only 7 bits out of a byte, i.e. first bit of every byte is always 0. The optional BOM is 2B 2F 76 38 2D.
  • UTF-8. Aka: unicode-1-1-utf-8; unicode-2-0-utf-8; x-unicode-2-0-utf-8; 65001.
    • Encodes UCS-4 with 1-6 octets using 8 bits out of each octet. If a document consists of mostly ASCII characters then a UTF-8 encoded version of it will be around the size of an ASCII encoded version of it. This has become a popular character set for the web.
    • UTF-8 is defined in ISO 10646-1:2000 Annex D and also described in RFC 2279 as well as section 3.8 of the Unicode 3.0 standard.
    • UTF-8 has the following properties:
      • UCS characters U+0000 to U+007F are ASCII character and will be encoded as a single byte with a value between 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
      • All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bits set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
        • The first byte of a multi-byte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for the first byte in the sequence.
        • Each additional byte in a multi-byte sequence is in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
      • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
      • The sorting order of big-endian UCS-4 byte strings is preserved.
      • The optional BOM is EF BB BF. A BOM is optional with UTF-8 because UTF-8 has a fixed byte order. When displayed on the system viewing this page, the EF BB BF bytes are displayed as ï » ¿.
      • NOTE: ISO Latin 1 code points that are non-ASCII are represented with 2 bytes using UTF-8, as opposed to the 1 byte used by iso-8859-1 or win1252 encoding.
    • The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character. Note that the first byte indicates how many bytes are used to represent the character.
      UCS-4 Code Points UTF-8 Bytes/Octets # of Free Bits # Code Points Expressible Note
      U-00000000 - U-0000007F 0xxxxxxx 7 2^7 = 128 ASCII
      U-00000080 - U-000007FF 110xxxxx 10xxxxxx 5+6 = 11 2^11 = 2,048 ISO Latin 1 and more
      U-00000800 - U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 4+6+6 = 16 2^16 = 65,536 Max of 3 bytes covers UCS-2
      U-00010000 - U-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 3+6+6+6 = 21 2^21 = 2,097,152
      U-00200000 - U-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 2+6+6+6+6 = 26 2^26 = 67,108,864
      U-04000000 - U-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1+6+6+6+6+6 = 31 2^31 = 2,147,483,648 Max of 6 bytes covers UCS-4
  • UTF-16. Encodes UCS-2 and a subset of UCS-4 by using 2 bytes per character. Surrogate pairs encode the UCS-4 characters by designating 1,024 code points (D800-xDBFF) as the high (first) half of a pair which can be combined with 1,024 code points (xDC00-xDFFF), the low (second) half. Whereas before you had 2^16 = 65, 536 possibilities, now you have (2^16)-(1,024*2)+(1,024^2) = 1,112,064 possibilities! UTF-16 and its variations are used by Java and Windows.
    • UTF-16BE. This is what Microsoft Notepad refers to as "Unicode Big-Endian". Microsoft FrontPage sets this as charset=unicodeFFFE (opposite of the actual BOM!). The required BOM is FE FF, (þ ÿ ).
    • UTF-16LE. This is what Microsoft Notepad refers to as "Unicode". Microsoft FrontPage sets this as charset=unicode. The required BOM is FF FE, (ÿ þ).
  • UTF-32. Encodes UCS-4 by using 4 bytes per character. UTF-32 and its variations are used by variations of Unix.
    • UTF-32BE. The required BOM is 00 00 FE FF.
    • UTF-32LE. The required BOM is FF FE 00 00.


GeorgeHernandez.comSome rights reserved