Intro

Unicode is the best and hopefully last character set.

ASCII uses 7 or 8 bits  to encode up to 127 characters (x0-x7F) and is fine was fine for basic English. Character sets that fully utilize the 8th bit can encode up to 255 characters (x80-xFF) and is sufficient for many European languages. However it can be onerous in a multi-language scenario to deal with multiple character sets. This is where Unicode comes in.

Unicode is a A MBCS (Multi-Byte Character Set) that uses 2-4 bytes worth of possible code points. Unicode was developed by Unicode.org (The Unicode Project). Unicode is the sensible international and cross-platform character set.

ISO 10646 defines the UCS (Universal Character Set) used by Unicode. UCS was designed to be a superset of all other character sets. ISO 10646-1 was first published in 1993. ISO 10641-2 was published in 2001 and added characters outside of the BMP (Basic Multilingual Plane).

Unicode 1.1 corresponded to ISO 10646-1:1993, Unicode 3.0 corresponds to ISO 10646-1:2000, and Unicode 3.2 adds ISO 10646-2:2001. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future. In general, Unicode is more comprehensive than ISO 10646.

As a side most of the code points are for complete characters (EG: A = U+0041) or pre-composed characters (EG: Ä = U+00C4). However some of the code points are for combining characters which work similar to dead keys (See Shortcuts) or non-spacing accent keys on a typewriter. EG: Ä = A + a combining diaeresis = U+0041 U+0308.

Microsoft started supporting Unicode with Windows 2000 and SQL Server 7. Microsoft defaults to UCS-2 (i.e. UTF-16LE). UTF-8 is the probably the best Unicode encoding scheme for Linux and Unix.

Code Blocks Of Code Points

UCS-2 (Universal Character Set, aka BMP; Plane 0;) uses 16 bits (x0 to xFFFF = 0 to 65,535) for the code points. UCS-2 was formed when Unicode Standard 1.1 and ISO-16046-1 merged together.

UCS-4 uses 31 bits ( x0 to x80000000 = 0 to 2,147,483,648) for code points but is not as popular. UCS-2 is part of ISO-16046-2.

Each Unicode (UCS-2) code point is usually referenced by its hexadecimal value, and is put in this format: U+4DigitHexValue. EG: U+00A5 is the code point for the Yen symbol (¥). Related code elements form groups called scripts. The number space set aside for a script is called its code block. Code blocks usually start at some nice round hexadecimal number. EG:

Here is approximately how the two bytes are distributed:

General Scripts
|       Symbols
|       |   CJK Auxiliary                          Compatibility
|       |   |   CJK Ideographs                 Private Use     |
|       |   |   |                                         |    |
V       V   V   V                                         V    V
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|***|*  |** |***|***|***|***|***|***|   |   |   |   |   | ##|##*|
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
00  10  20  30  40  50  60  70  80  90  A0  B0  C0  D0  E0  F0  FE

+---+             +---+           +---+
|***| Allocated   |   | Reserved  |###| Private Use
+---+             +---+           +---+

Unicode Encoding

To review character set terminology:

Unicode has additional complexity:

A BOM (Byte Order Mark) may be inserted at the beginning of data streams. The Unicode character U+FEFF, aka ZWNBSP (Zero Width Non-Breaking SPace), is used as the BOM. Simply convert the character into its appropriate encoding scheme. When displayed on the system viewing this page, the FE and FF bytes are displayed as þ and ÿ, respectively. There are several reasons for using a BOM:

  1. ZWNBSP is very rare in non-Unicode files so its mere presence indicates a Unicode file
  2. To indicate the byte-order of the file (big-endian or little-endian).
  3. To indicate that the encoding scheme of the Unicode file.

Unicode has the following encoding schemes:

Page Modified: (Hand noted: 2007-07-28 02:27:45Z) (Auto noted: 2010-06-22 15:05:29Z)