Intro

Character sets are standards that map a specific set of real world characters (code elements) to specific numbers (code points) and then encode those code points as bytes (character encoding or encoding schemes). Older character sets corresponded to the characters commonly used in a specific language. Unicode is the modern character set which tries to encompass most languages. A collation is a set of rules for comparing (esp. sorting) characters in a character set or database. EG:

  • The Yen code element is ¥.
  • The Yen's code point in different code point systems:
    • ASCII: out of range.
    • Window ANSI 1252: 165 = 0xA5.
    • ISO Latin-1 (iso-8859-1): 165 = 0xA5.
    • Unicode/UCS-2: 165 = 0xA5 = U+00A5.
    • Unicode/UCS-4: 165 = 0xA5 = U+000000A5
  • The Yen is character encoded this way in different encoding schemes:
    • ASCII: out of range.
    • Windows ANSI 1252: 0xA5 = 10100101.
    • ISO Latin-1 (iso-8859-1): 0xA5 = 10100101.
    • UTF-8: 11000010 10100101. UTF-8 rocks in Linux.
    • UTF-16: 00000000 10100101. UTF-16LE in Microsoft Windows.
    • UTF-32: 00000000 00000000 00000000 10100101

Note that a user-perceived character (an abstract charater) might actually take multiple code points to represent. EG: The abstract character ǵ, may be represented with 1 code point (U+01F5) or 2 code points (U+0067, U+0301).

A document consists of binary data. A document can explicitly state the character set for itself or parts of itself (see Web Character Sets), but a viewer, like a browser, can also "override" and decode the digital data using its own choice of character encoding. EG: These two bytes (xCE xB1) will be displayed differently in different by browsers depending upon the "Character Encoding chosen for the page by the viewer. In Firefox, go to the menu for View > Character Encoding. In Internet Explorer, go to the menu for View > Encoding.

  • Digital data:
    • 1100111010110001. In binary.
    • 206 177. In decimal.
    • CE B1. In hexadecimal.
  • Displayed as:
    • α. The raw hexadecimal attempted as HTML NCR (Numeric Character Reference): α.
    • Glyph for capital I = capital I with circumflex and plus-minus sign in Western European (ISO; iso-8859-1; Windows; windows-1252;).
    • Glyph for Greek alpha = alpha in Multilingual (utf-8).
    • Glyph for Chinese qia4 = qia4 in Traditional Chinese (big5).
    • Glyph for Chinese wei3 = wei3 in Simplified Chinese (bg).
    • Glyph for Japanese ryuu = ryuu in Japanese (euc-jp).
    • Glyph for Korean gwan = gwan in Korean (euc-kr).

Once a set of bits have been mapped to a code point, then that code point can be mapped to its code element character, and then software can present that character and apply different fonts, styles, and sizes. EG:

In the ASCII character map the code element J has a code point of 74 (x4A). This code point is encoded as the binary number 1001010 (which is the x4A). The binary code can then be interpreted by programs such as browsers and presented to users:

Formatting Applied Example
Font J
Comic
J
Windings
(normally a smiley face)
Style J

underlined

J

italic

Size J
HTML Size 6
J
HTML Size 1

You may also want to see my articles on Typography.

Choosing a character set is important especially if you have to deal with international code, multiple platforms, or databases. EG:

When SQL Server is installed, a Sort Order ID must be set that is based on a character code. If you try to restore a database to that installation, then it must have the same Sort Order ID. Otherwise you may have to rebuild the database, using something like rebuildm.exe.

Major Character Sets

Here are some of the major character sets:

  • ASCII (American Standard Code for Information Interchange, aka Standard ASCII; plain text, ISO-646). A SBCS (Single Byte Character Set) that uses 7 bits of a byte (x0 to x7F = 0 to 127) to make 128 code points (since 27). The 8th bit was used for parity checking or for anything an application wanted to do with it. Used by Linux/Unix.
  • High ASCII are SBCSs that use the 8th bit to encode additional characters by utilizing the bit above the ASCII range.  Here are SBCSs that use all 8 bits of a byte (0-255; x0-FF) to add 128 more code points (since 28 = 256). There are different character sets (aka code pages) for different uses, languages or language groups. Many of the character sets are super sets of ASCII.
    • OEM character sets (Original Equipment Manufacturer). The 8th bit often contained characters for line drawings as a carry over from pre-GUI (Graphics User Interface) days. Used by DOS, OS/2, floppy disks, and the FAT system (File Allocation System).
    • ISO character sets (International Organization for Standardization). There are several ISO character sets but ISO Latin 1 (iso-8859-1) is probably the most prevalent.
    • ANSI character sets (American National Standards Institute). An ANSI character set refers to ASCII and various CPs that Windows submitted to ANSI/ISO but didn't get approved. So it refers to things like (Windows-1252 vs. ISO-8859-1) or (Windows-1250 vs. ISO-8859-2). Used by the Windows OS.
  • Unicode, aka UCS (Universal Character Set). A MBCS (Multi-Byte Character Set) that uses 2-4 bytes worth of possible code points. The code points may be encoded using a 1-6 bytes per character. Unicode is the sensible international and cross-platform character set. Used by Windows NT/2000 and Linux/Unix.
    • Windows Glyph List 4 (WGL4) [W]. Aka the Pan-European character set. The WGL4 are 652 Unicode characters that are guaranteed to display on all major platforms of Microsoft Windows. It consists primarily of code pages 1252, 1250, 1251, 1253, 1254, 1257, and 437.

Here is a summary table of the major character sets:

Character Set Bits Decimal Hexadecimal
ASCII 7/8 127 7F
High ASCII 8 255 FF
Unicode, UCS-2 16 65,536 FFFF
UCS-4 31 2,147,483,648 7FFF FFFF

Here are some languages and the ISO character sheets they use. (And all of them can use Unicode.)

Language charset
Afrikaans (af) iso-8859-1, windows-1252
Albanian (sq) iso-8859-1, windows-1252
Arabic (ar) iso-8859-6. windows-1250
Basque (eu) iso-8859-1, windows-1252
Bulgarian (bg) iso-8859-5
Byelorussian (be) iso-8859-5
Catalan (ca) iso-8859-1, windows-1252
Croatian (hr) iso-8859-2. windows-1250
Czech (cs) iso-8859-2. windows-1250
Danish (da) iso-8859-1, windows-1252
Dutch (nl) iso-8859-1, windows-1252
English (en) iso-8859-1, windows-1252
Esperanto (eo) iso-8859-3
Estonian (et) iso-8859-15
Faroese (fo) iso-8859-1, windows-1252
Finnish (fi) iso-8859-1, windows-1252
French (fr) iso-8859-1, windows-1252
Galician (gl) iso-8859-1, windows-1252
German (de) iso-8859-1, windows-1252 or 1250
Greek (el) iso-8859-7
Hebrew (iw) iso-8859-8
Hungarian (hu) iso-8859-2. windows-1250
Icelandic (is) iso-8859-1, windows-1252
Inuit (Eskimo) languages iso-8859-10
Irish (ga) iso-8859-1, windows-1252
Italian (it) iso-8859-1, windows-1252
Japanese (ja) shift_jis, iso-2022-jp, euc-jp
Latvian (lv) iso-8859-13, windows-1257
Lithuanian (lt) iso-8859-13, windows-1257
Macedonian (mk) iso-8859-5
Maltese (mt) iso-8859-3*
Norwegian (no) iso-8859-1, windows-1252
Polish (pl) iso-8859-2. windows-1250
Portuguese (pt) iso-8859-1, windows-1252
Romanian (ro) iso-8859-2. windows-1250
Russian (ru) koi-8-r, iso-8859-5
Scottish (gd) iso-8859-1, windows-1252
Serbian (sr) iso-8859-5
Slovak (sk) iso-8859-2. windows-1250
Slovenian (sl) iso-8859-2. windows-1250
Spanish (es) iso-8859-1, windows-1252
Swedish (sv) iso-8859-1, windows-1252
Turkish (tr) iso-8859-9, windows-1254
Ukrainian (uk) iso-8859-5

Different companies refer to a character set by an identifier, often called a code page number. Here are some of the available code page identifiers used by Windows.

Identifier Meaning OEM/ANSI Comment
037 EBCDIC   Used in mainframes, esp. IBM.
437 MS-DOS United States OEM IBM DOS and OS/2. IBM PC Extended Character Set; Extended ASCII; High ASCII; 437 U.S. English.
500 EBCDIC "500V1"    
708 Arabic (ASMO 708) OEM  
709 Arabic (ASMO 449+, BCON V4) OEM  
710 Arabic (Transparent Arabic) OEM  
720 Arabic (Transparent ASMO) OEM  
737 Greek (formerly 437G) OEM  
775 Baltic OEM  
850 MS-DOS Multilingual (Latin I) OEM Standard MS DOS. 850 Multilingual.
852 MS-DOS Slavic (Latin II) OEM  
855 IBM Cyrillic/Russian OEM  
857 IBM Turkish OEM  
858 Multilingual. Like 850 but with the Euro symbol OEM  
860 MS-DOS Portuguese OEM  
861 MS-DOS Icelandic OEM  
862 Hebrew OEM  
863 MS-DOS Canadian-French OEM  
864 Arabic OEM  
865 MS-DOS Nordic OEM  
866 MS-DOS Cryllic/Russian OEM  
869 IBM Modern Greek OEM  
874 Thai OEM/ANSI  
875 EBCDIC    
932 Japanese OEM/ANSI Double byte.
936 GBK, Chinese (PRC, Singapore; Simplified) OEM/ANSI Double byte.
949 Korean OEM/ANSI Double byte.
950 Chinese (Taiwan; Hong Kong SAR, PRC; Traditional) OEM/ANSI Double byte. Most common variant: Chinese Traditional (Big5); big5; cn-big5; csbig5; x-x-big5;
1026 EBCDIC    
1200 Unicode (BMP of ISO 10646); UCS-2LE; Unicode little-endian ANSI Window NT/2000 and HTML. ISO-1604-6; UCS.
1201 UCS-2BE; Unicode big-endian ANSI  
1250 Windows 3.1 Eastern European ANSI  
1251 Windows 3.1 Cyrillic ANSI  
1252 Windows 3.1 US (ANSI) ANSI Windows 3.x/9x, Macs, and HTML.
"ANSI" comes in two versions (the difference is found at decimal 128-159 (hexadecimal 80-9F)):
  • Windows ANSI. Western European (Windows); windows-1252; US/Western European; Western.
  • ISO Latin 1 ANSI. Western European (ISO); iso-8859-1; ANSI_X3.4-1968; ANSI_X3.4-1986; ascii; cp367; cp819; csASCII; IBM367; ibm819; iso-ir-100; iso-ir-6; ISO646-US; iso8859-1; ISO_646.irv:1991; iso_8859-1; iso_8859-1:1987; latin1; us; us-ascii; x-ansi; iso-latin-1.
1253 Windows 3.1 Greek ANSI  
1254 Windows 3.1 Turkish ANSI  
1255 Hebrew ANSI  
1256 Arabic ANSI  
1257 Baltic ANSI  
1258 Vietnamese    
1361 Korean (Johab) OEM  
10000 Macintosh Roman   Used in every Mac on the planet. A superset of iso-8859-1 but everything after the ASCII is in a different order. x-mac-roman.
10001 Macintosh Japanese    
10006 Macintosh Greek I    
10007 Macintosh Cyrillic    
10029 Macintosh Latin 2, Central European    
10079 Macintosh Icelandic    
10081 Macintosh Turkish   x-mac-turkis.
65000 UTF-7 Unicode   utf-7; csUnicode11UTF7, unicode-1-1-utf-7, x-unicode-2-0-utf-7
65001 UTF-8 Unicode   The best! utf-8, unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8.

Note that an alias in bold is the preferred charset ID for the HTML <meta> tag:

<meta http-equiv="Content-Type" content="text/html; charset=characterSet">

<!-- Used to explicitly state the character set used.
    Examples of characterSet include windows-1252, iso-8859-1, and utf-8. / -->

EOLs and EOFs

Systems save EOLs (End Of Lines) and EOFs (End Of Files) differently.

  • Systems like Linux, Unix-like, Multics, BeOS, Amiga. The "correct" way.
    • EOL: LF. Aka Line Feed; ^j; NL; New Line; \n; 0x0A; Chr(10), U+000A.
    • EOF: No particular character.
  • Systems like Mac OS (up to 9), Commodore, TRS-80, Apple II, OS-9.
    • EOL: CR. Aka Carriage Return; ^m; \r; 0x0D; Chr(13), U+000D.
    • EOF: No particular character.
  • Systems like Windows, MS-DOS, OS/2. A carry over from ye olde days when computer terminals were more physical than virtual and the extra characters were used to buy time for the machine to physical relocate to the next line.
    • EOL: CR and LF.
    • EOF: SUB. Aka Substitute; ^z; Chr(26).
  • Other line terminator-like characters:
    • FF: Form Feed, U+000C, \f, New Page, NP, 0x0C, ^l.
    • NEL: Next Line, U+0085.
    • LS: Lind Separator, U+2028.
    • PS: Paragraph Separator, U+2029.


GeorgeHernandez.comSome rights reserved