Part I: wandering through unicode, legacy fonts, and browsers

In the beginning was ASCII, at seven bits. And it was good, until someone noticed a few missing characters. In this way, ASCII with eight bits was born. But alas! There were even more characters to be respresented. And thus began the exodus in search of ways to show these missing characters.

And the way split and conmingled between encoding and rendering. After all, since no encoding existed for some characters, those attempting to render them on the screen made up their own encodings as they went. And things continued in this sorry mess for years, until Unicode was born…

Of course at this point, now various different programs such as mailers, forums, word processors, internet browsers and even programming languages all are in the process of being updated to understand unicode. It’s a giant bear of a mess along many fronts, although I do believe Unicode is the way forward, the essential problem is that it’s been introduced at this point instead of at, say, the point ASCII originally came on the scene. But no matter, things will sort themselves out. Over the years and years, but hey.

So what, exactly, am I talking about? In order to represent the alphabet, two things are needed. First is a kind of representation that says which character we’re talking about. So for example (decimal) 65 is used to represent the letter ‘a’. The second kind of thing is whatever is used to render ‘a’. It could be a, or a or a. These two things are the encoding and the rendering of a given character. Conceptually these two properties of a character should be distinct. In practice, of course, it’s not always been that cleanly handled, and there are some issues where the lines are legitimately blurred.

The problem was, of course, that the original encoding table (ASCII) was much too limited to handle languages other than English. To address this, a number of ISO 8859’s were developed to cover additional characters such as ß or ñ and other marks and symbols such as © and £. However, since rendering (or typography) was not considered in these representations. a number of legacy fonts developed that used additional proprietary (and conflicting) encodings for additional information not covered in the standards. And in all of this, languages that did not even use the Latin alphabet (such as Greek, Russian, Arabic, Hebrew, Japanese, Chinese, and so on and on) were definitely ill-represented overall. In most cases there are several possible representations and standards to use, which results in a nightmare for anyone trying to represent extended or other character sets in programs that make use of them. (Which basically potentially includes any program which ever tries to communicate with its user in anything other than international symbols, but I digress.)

Unicode

The idea behind Unicode is to create one giant encoding standard for all of this (leaving the typography alone and up to whatever rendering a particular program wishes to use, or whatever font set the user has installed). Sounds simple enough although even this idea is fraught with complexity and inconsistencies. For example Unicode absorbed many of the original encoding standards in order to ensure backward compatibility; and has made various inconsistent decisions on the inclusion of other characters in different ways. However. the underlying concept is sound, and if it takes another twenty years to refine it, the end result should still be better than the cacaphony there is now.

Enough with the background. I want to discuss most of this in the context of browser rendition, since this is a good deal of what I work on anyhow.

Let’s take a quick look at this:

I have it up here as an image to guarantee that everyone here can see this word. (I should note, by the way, that I picked this word out at random, but I do not know ancient Greek. When I looked it up, it turns out it’s an adjective meaning “eager to be of service”, which I find vastly amusing.)

In Unicode, I could encode this as follows:
πρόθυμος
I am using Numeric Character References here, instead of raw UTF-8, mostly because IE6 expects such representation although in theory either should be acceptable and certainly Firefox, Safari, and Opera are all happy with either. The codes can be found here and here.

In any case the above should render similar to the image above:
πρόθυμος (NCR)
I could also put it down like this:
πρόθυμος (direct UTF-8)
but Internet Explorer may not render this correctly.

Now if it did not, there are several considerations to check.

The first will be whether or not the page itself is correctly set up for viewing Unicode. Since I’m the one running this show, of course it is. There needs to be a declaration at the top something like this:

<meta
    http-equiv="Content-Type"
    content="text/html; charset=UTF-8"
/>

where “UTF-8” tells the browser to expect Unicode encodings in the following text. By the way, it’s a good habit to start using this in web pages, because this also encompasses the original ASCII and legacy encodings, so it will not break existing pages, but easily allow them to expand to cover encodings found only in Unicode. Although, I’m going to leave alone the entire issue of byte representation that’s actually tucked away here, with UTF-16 and UTF-32 lurking around the corner.

Since I can eliminate that as a problem off the bat, the next consideration is the browser being used to view this. All the modern browsers (meaning: Firefox (all OS), Opera (all OS), Safari, Internet Explorer 6 (latest patches), Netscape 7.2 plus and derivatives) should understand this. Some of the older ones may not. In particular IE on the Mac never contained Unicode support. In many cases, setting several menu options may be necessary to enable the support — check the documentation for the particular browser for “character encoding”. For IE, check the “user-defined” options. View->Encoding needs to be set to user-defined, and then in the Internet Options, a suitable font needs to be selected for the “user-defined” font (eg, not the Latin, etc fonts).

Different browsers will support unicode at different levels. For example, on IE6, it’s not only necessary to enable the character encoding support, but also to do certain registry edits for both the browser and the operating system (IE7 appears to contain more support for Unicode, fortunately). Firefox and Opera only need to be informed of a compatible font. Safari actually breaks down the “unicode” fonts into the different regions on the (very reasonable) assumption that one might use different unicode fonts for different languages and not some “universal” unicode font. So in Safari, setting the unicode “Greek” to the correct font will allow the above to display.

A good font for the above display that I would recommend the Cardo font. There are several out there, including Arial Unicode MS, but I do not recommend this or any of the older ones, as support is only to version 2.0 of the Unicode standard, and we’re well past that now. I’ll return to this point later on. For now, I’m going to detour briefly into legacy fonts, just to illustrate why they were such a bad idea.

Legacy Fonts

I’m not even going to render them on this page :-). Actually, I cannot render them in this page, because they require different meta declarations, i.e.:

<META http-equiv="Content-Type"
          content="text/css; charset=ISO-8859-1"
/>

For Polytonic Greek, there are several well known fonts out there, including WinGreek, GreekKeys, and Ismini. The first thing to note off the bat, is that some of these are actually OS dependent. The second is that they can be broken, even within their original scope. Take Ismini, for example. ί is supposed to be represented as ¼ in this font. But regardless of the browser, regardless of proper installation of the font, regardless of proper encapsulation (well, I’m staying away from the content/presentation arguments in css/html — enough principles are violated as it is ;-) ) within a <font face="Ismini"> element it still displays in the Latin-1 representation of ¼. GreekKeys only currently supports Macintosh; there was a Windows version offered several years back, but support for that has been discontinued. In its defense, though, I should note that a unicode version is now offered.

In any case, the representations for the above word in these fonts, respectively, are

<font face="Greek Old Face 98">prÒqumoj</font>
<font face="Athenian">prñyumow</font>
<font face="Ismini">pr¿uymoq</font>

Note the need to explicitly switch font face in order to obtain the results I want. Unless the entire page will be rendered in one of these fonts, I have to switch around as needed. Unicode simplifies all of this, and theoretically given sufficient support (a unicode keyboard, unicode display in the editing software, etc) I would not even need to use the numerical character representation of specific characters. However, as I cannot guarantee all visitors to this site would be similarly equipped, this is a fallback for now (and indeed, since I — nor anyone — will never have a keyboard that includes all the possible alphabets and symbols for non-alphabet languages, this kind of workaround will always be needed anyway).

Interestingly, some of the newer browsers themselves have bugs in attempting to handle legacy fonts. For example Firefox had trouble with the style font-family method of switching between fonts, but was able to display correctly once it was changed to inline font face notation.
So
<font face="Greek Old Face 98">prÒqumoj</font>
worked but <span style="font-family: Greek Old Face 98;">prÒqumoj</a> did not. Trying to get this kind of stuff to display among all the browser/operating system combos out there can really take quite a bit of detective work.

Minor note on WordPress

This is getting lengthy and I still have more stuff I want to discuss, so I’m going to wrap up here and finish the rest in the next post. But I thought I’d point out that somewhere along the line in the course of editing and re-editing this post, the NCR’s I was using to illustrate the various unicode words and such above kept getting rendered into their UTF-8 representations. I wanted NCR’s for several reasons, partly because it allows me to reference the Unicode Manual on which ones they are, but also because some browsers require NCR. But somehow, and I’m not sure if it was internally in WordPress or some combination of cut and paste in either Windows XP or Ubuntu that did it, but it’s a classic example of the annoyances that are still present when dealing with “exotic” character sets.

del.icio.us:Part I: wandering through unicode, legacy fonts, and browsers  digg:Part I: wandering through unicode, legacy fonts, and browsers

1 Comment »

  1. Brian Layman said,

    July 25, 2006 @ 7:29 am

    Thank you for this - it was much more informative than the typical “You should use UTF-8″ one liner on the subject. It’s nice to see all this in context.

    I’m looking forward to the next post…

RSS feed for comments on this post · TrackBack URI

Leave a Comment

Bad Behavior has blocked 752 access attempts in the last 7 days.