This page explores how character sets pertain to webs and how to administer and choose character sets for web pages. See also Storing HTML in Databases.

Character Set Terminology

Before discussing character sets as they pertain to web pages, a few terms must be clarified:

  • Character sets are standards that map specific real world characters (code elements) to specific numbers (code points) and then encode those code points as bytes (character encoding or encoding schemes). EG: ANSI and Unicode. The physical file of a page is encoded in bytes according to a character set. The HTML of a page can also specify which character set to use with a meta tag of this syntax in the <head> tag of the page: <meta http-equiv="Content-Type" content="text/html; charset=CharacterSet">. Note that emails would use something like content-type: text/plain; charset="utf-8".
  • Real world language. An actual language used by people when they write and talk to each other. EG: English, Spanish, and German.
  • Spelling language (aka page language in MS FrontPage parlance). Identifies a real world language used for spell checking (during development) and search engines (for viewing the page by users). The spelling language can be specified:
    • For the whole page with a meta tag of this syntax in the <head> tag of the HTML: <meta http-equiv="Content-Language" content="SpellingLanguage">.
    • For portions of a page by encasing a portion in a <span> tag with this syntax: <span lang="SpellingLanguage">.
  • Keyboard language. Identifies a real world language so keys on a keyboard can be correctly mapped to a character set. When an OS is installed, it detects the language specific keyboard and determines the corresponding keyboard language. However some keyboards work well with multiple real world languages, so it may be worth it to switch between different keyboard languages.

Administering Character Sets

In a single-language situation, the language settings are set once and forgotten. However, in a multi-language situation, the settings may have to be customized as pages are made.

There are different places and reasons why a person might use different language settings.

  • For an entire web site:
    • Decide which real world language will be displayed in message boxes returned from the server. These message boxes include error messages and search failures. EG: In MS FrontPage, this is accessed with the Tools menu > Web Settings option > Language tab > Server Message Language box.
    • Decide the character set used by default for encoding new pages. The files are saved in that with that encoding scheme and the Content-Type <meta> tag is entered in the HTML. EG: In MS FrontPage, this is accessed with the Tools menu > Web Settings option > Language tab > Default Page Encoding box.
      • If "None" is chosen, then new pages are written using the keyboard language.
      • If a keyboard language is specified, then FP will use it unless your keyboard is incompatible. You also have the option of making FP use your specified keyboard language regardless of the keyboard.
      • Note that FrontPage allows you to select "US/Western European". This makes new pages encode as Windows ANSI and inserts <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">  into the <head> tag of the new pages. Microsoft should give you the option to have the default for new pages as ISO Latin 1 (is0-8859-1), but they only offer "US/Western European" which is actually windows-1252.! At least they give you the options for Unicode (UTF-16LE), Unicode big endian (UTF-16BE), or UTF-8. some Unicode options, which is useful for later versions of Windows which can actually make pages encoded in Unicode. My personal opinion is that most pages should use utf-8.
    • Decide the spelling language used by default for new pages. EG: In MS FrontPage, this is accessed with Tools menu > Page Options selection > General tab > Default Spelling Language box. If "None" is selected then FrontPage spell checks new pages based on the keyboard language. It's a goofy feature because it this does not automatically enter the Content-Language <meta> tag into the HTML.
  • For an individual page. Note that settings for individual pages override settings applied to the entire web site.
    • Decide the character set used by that page. EG: In MS FrontPage, this is accessed with page right-click, Page Properties selection, Language tab.
      • If you want to start saving a page in a new character set, select it in the Save The Document As box. This also enters the Content-Type <meta> tag in the HTML.
      • If FrontPage displays characters incorrectly for a given page, then FrontPage has assumed it is using a different character set then its actual character set. Select the actual character set in the Reload The Document As box. You will have the option to cancel this.
    • Decide the spelling language used by that page. This is partially implemented by inserting the appropriate <meta> tag in the HTML but is also handled by your development tool. EG: In MS FrontPage, this is accessed with right-clicking the page > Page Properties selection > Language tab > Page Language section > Mark Document As. This feature will insert the Content-Language <meta> tag in the HTML. It will also (very sneakily) enter the GENERATOR and ProgId <meta> tags in the HTML.
  • For sections of text on an individual page.
    • Specify the spelling language used by portions of a page by encasing a portion in a <span> tag with this syntax: <span lang="SpellingLanguage">. EG: In MS FrontPage, this is done by selecting the text > Tools menu > Set Language option.

When a web document reaches a user, the browser sequentially checks the following to decide which character set it will use to translate the bytes received into characters:

  1. The HTTP content type returned by the server.
  2. The character set specified by the Content-Type <meta> tag in the header.

Choosing Character Sets

When deciding on a character set, take the following into account:

  1. If a site or page is written predominantly in a particular real language, choose the appropriate character set. EG: iso-8859-3 is an 8-bit character set for Esperanto.
  2. If a site or page has multiple languages with considerably different characters, choose Unicode (probably utf-8).
  3. If a page has a few characters that are not part of the character set for the current page, then those characters can be entered as its equivalent using either NCR (Numeric Character References) or CER (Character Entity Reference). CERs use symbolic names so that authors need not remember code points. EG: For the character a with a grave accent (à), the NCR is either &#224; or &#xE0;, while the CER is &agrave;.

One would think that defaulting to windows-1252 would be a fine character set because it is a super set of iso-8859-1 (code points 128-255 (x80-xFF)) and you can use NCR or CER for Unicode characters. EG: Entering the euro sign on your system (€) will enter the byte for 128 (0x80). If the viewer has a different character set than windows-1252, then that byte may be interpreted as something all together different.

One would also think that specifying iso-8859-1 as the character set would be fine. However the future is globalization and that means using a Unicode character set. A particular problem with this is that although Unicode is a super set of iso-8859-1 for code points, the character encoding may be different! EG: For the Yen character (¥), both iso-8859-1 and Unicode have the same code point of 165 = 0xA5. However iso-8859-1 encodes the Yen as a byte of 165 = 0xA5 = 10100101, while utf-8 would encode it as 11000010 10100101 = C2 A5. If a utf-8 reader came across the iso-8859-1 byte for the Yen, it would have trouble parsing it.

My opinion is that all pages should use utf-8 as the character set.

  • Systems that make heavy usage of non-ASCII characters should enter non-ASCII characters (those greater than 127 = 0x7F which also includes the latter iso-8859-1 characters) as genuine utf-8 bytes.
  • Systems that make rare or light usage of non-ASCII characters should enter non-ASCII characters as NCR or CER (preferably the former). The neat thing about this solution is that your files will consist entirely of extremely reliable ASCII bytes!
  • uft-8 is a fabulous encoding choice for Unix/Linux because Unix/Linux files are largely ASCII.

Character References

Character sets are often encoded in plain text documents such as HTML and XML using a Character Reference, especially for characters that are difficult to enter via a keyboard or they have particular meaning in the syntax. Character references are either NCR (Numeric Character References) or CER (Character Entity Reference). CERs use symbolic names so that authors need not remember code points. EG: For the Yen character (¥), the NCR is either &#165; or &#xA5;, while the CER is &yen;.

HTML has 252 CERs, while XHTML has 253 CERs, because &apos; (') is defined in XML and XHTML but not in HTML. XML has 5 CERs or Predefined CERs (PCERs). For XML and XHTML, the 5 PCERs must be used outside of the tags/elements/entities. Here are the 5 PCERs:

  • " = &quot; = &#x0022; = &#34; = QUOTATION MARK.
  • & = &amp; = &#x0026; = &#38; = AMPERSAND. This one is often missed in URLs. EG: It should be http://fake.com?x=1&amp;y=2 instead of http://fake.com?x=1&y=2.
  • ' = &apos; = &#x0027; = &#39; = APOSTROPHE. Possibly a SQL string issue too. 2005/2007, MSIE does not recognize &apos; but does recognize &#39;. View the next few words in IE and other browsers: Don't with '; Don't with &apos;; Don't with &#39;.
  • < = &lt; = &#x003C; = &#60; = LESS-THAN SIGN
  • > = &gt; = &#x003E; = &#62; = GREATER-THAN SIGN

See also CERs in HTML [w3.org/TR/html401/sgml/entities.html], SYMBOL Characters and Glyphs [w3.org/Math/characters/html/symbol.html], and List of XML and HTML character entity references [W].

CERs are case sensitive. EG:

  • Upper Case Sigma (Σ or Σ) has a CER of &Sigma;.
  • Lower Case Sigma (σ or σ) has a CER of &sigma;.


GeorgeHernandez.comSome rights reserved