Part II: wandering through unicode, legacy fonts, and browsers
Precomposed versus Combining
In the course of putting together the encodings (called code points) in Unicode, a number of decisions had to be made regarding the current existing encodings, particularly well known and/or well established ones. In some cases, even though the Unicode Consortium has a particular policy regarding some encode vs render issues, there are inconsistent inclusions due to this grandfathering of prior established encodings and (to be quite honest) outright mistakes on the part of the Consortium. The question of precomposed characters versus combining characters is a classic one.
As a simple example, let’s take the greek letter alpha — ά — with an acute accent. Should this be represented as ᾅ or as ᾅ? (See here and here for the code charts.) In other words, should a single fixed code point be used to represent something that’s really a combination of a letter and an accent, or should there be a code point for the letter and a code point for a combining accent which is then combined with the previous letter?
My two cents worth is that since the concept of an accent applies to more than one character, it is an independent concept, and thus combining is the way to go. Not to mention more economical — with this approach, to add an accent to any other character requires only one additional code point in the charts. Implementing it the other way means for each character that might be accented, a second code point must be reserved — and as that cannot possibly be comprehensive, it will be by nature reserved to “legal” or “existing” combinations at the very least. So why are both present? That’s the grandfather clause at work, since the concept of combining characters postdates the establishment of many of the old encodings. But philosphically, the Unicode Consortium’s wise enough to agree with me ![]()
Now suppose a web developer needs to represent some manuscript online. The texts may themselves contain “mistakes” which are intended to be reproduced as is. For example, in (ancient) Greek, an epsilon will never be combined with a circumflex due to the rules of the language. But perhaps an idiosyncratic author did so anyway, or a manuscript is badly marked up, etc. If the only encoding at hand for ancient Greek contained only the “legal” stuff, I’d be out of luck if I needed to show this. Given this, clearly the use of combining diacritics makes every sense. While there is no precomposed character for epsilon-with-circumflex, I could still use ε͂ to represent this — ε͂̂ …
But wait! That’s not quite the end of the story! Not all browsers are good about combining diacritics. It’s actually something of an artform, and the positionings will depend whether the base letter is small or caps, and whether there are other diacritics also being combined with it. Frankly, most browsers don’t cope so well. So to get a usable display in as many cases as possible, I find it worth scanning ahead in the text to find all the associated diacritics and checking for the existence of a precomposed character before attempting to do it through building it up.
Test Case
Let’s take an interesting example here: 
That is an alpha with a rough breathing, an acute accent, and an iota subscript.
If the alpha is uppercased, it looks like this: 
Notice how the positioning of the iota subscript changes when alpha is capital, and both the breathing and the accent sidle a little to the side to get out of the base letter’s way.
I’m going to demonstrate how this pair of letters looks in precomposed versus combining notation in assorted browsers and operating systems. First of all here’s the table I used to build up my examples. The first column has the precomposed character for each of the above images. The second has the base character plus the combining diacritics for the above images. The diacritics are in “proper” order. This order is particular to each language, and in this case the order is supposed to be breathings-accents-iotas and that is what drives the order you see in the final representation — the breathing will always be to the left of an accent, and so on. The third column has the diacritics in reversed order. This is a very interesting situation that I don’t believe has yet been well enough assessed. I don’t even know that it should be: just as it’s reasonable to expect words to only make sense if their letters are ordered properly, it could be just as reasonable to expect diacritics to be listed in proper order. On the other hand, it might be reasonable for some programs (especially a word processor) to order them properly — these are computers and can churn out a few more CPU cycles, after all. I found it very interesting how badly handled this third column was, so I included it for interest’s sake.
| ᾅ | ᾅ | ᾴ̔ |
| ᾍ | ᾍ | Ά̔ͅ |
There are several things to note here. First of all, particular diacritics, depending on the language of course, have particular positions. Some combine below, some along the top, some to either side and some as “overstrikes.” In addition, if there are multiple diacritics that are positioned similarly there is usually an order of precedence. Finally, if the letter is capitalised, that often affects the placement of the diacritic. In the images above (and in the table, if your browser is working correctly) the iota subscript moves to the side of the capital alpha.
First up on the chopping block: Internet Explorer (IE6 and IE7):
| IE6 (SP2) | IE7 (beta 3) | |
| Windows XP SP2 w/Cardo | ![]() |
![]() |
| Windows XP SP2 w/regedits & w/Cardo | ![]() |
![]() |
| Vista (beta 2) w/Cardo | n/a | ![]() |
I will admit I was surprised at how well Internet Explorer did here when it’s usually so miserable at dealing with odd things, especially internationalization issues. However, do notice that it’s clear MS didn’t consider the case where the combining characters might not be “in order”. In the third column, the diacritics along the top haven’t been properly spaced from each other, nor are they shifted slightly in the capital letter version. And the iota fails to adjust to the capital letter.
Moving along to Firefox and Opera, we have the following table:
| Firefox (1.5) | Opera (9) | |
| Windows XP SP2 w/Cardo | ![]() |
![]() |
| Windows XP SP2 w/regedits & w/Cardo | ![]() |
![]() |
| Vista (beta 2) w/Cardo | ![]() |
![]() |
| Linux (Ubuntu 6.06) w/Cardo | ![]() |
![]() |
| Mac 10.4.7 w/Cardo | ![]() |
![]() |
What I find very interesting is how the registry edits for Windows XP improve both Firefox and Opera’s ability to display combining characters properly. However, this ability immediately disappears in the tricky third column.
I’m surprised by Opera’s complete failure to try combining the characters on the Mac, especially since it did combining in both Windows and Linux. Note that it simply listed each after the base letter in the order given in the table. Opera lets you define which font you want to use for the ‘Greek and Coptic’ and the ‘Greek Extended’ blocks. Now, I’ve set Cardo as my default font for everything, so that should override anything that’s there. “Should” being the operative word. Here’s the problem: I go to Preferences –> Web Pages –> Normal Font and I select Cardo there. However, this doesn’t seem to affect the display of Greek text in Mac Opera. This is odd, since it does affect the display of Greek in Linux and Windows Operas. I can instead change the display of Greek text in MacOpera by going through Preferences –> Advanced –> Fonts –> International Fonts and there selecting ‘Greek’ as the ‘writing system’. However, Cardo is not listed here. And since ‘Greek’ is actually Opera’s name for the ‘Greek and Coptic’ set, to make sure I can render all of Polytonic Greek correctly I also have to select ‘Extended Greek’ as the writing system. Cardo is not listed there either. Why is it not listed there when it’s clearly installed and is available system-wide? I do not know.
Let’s check the last two browsers of interest: Konqueror and Safari:
| Konqueror (3.5.2) | Safari (2.0.4) | |
| Linux (Ubuntu 6.06) w/Cardo | ![]() |
n/a |
| Mac OS 10.4.7 | n/a | ![]() |
I’m giving Konqueror bonus points for the creativity it shows in vertically stacking the rough breathing and accent. Unfortunately, it’s completely invalid on technical merits. The iota subscript isn’t even underneath the base letter, either. Konqueror needs to completely fix this aspect of their rendering engine. Surprisingly, though (and I say that because Konqueror and Safari both share the KHTML rendering engine) Safari comes out a winner here, rendering the combining characters correctly.
So what I see here is IE6 (shockingly enough), IE7, and Safari handling the combining characters the best. I say that because both of them are able to render the test cases properly without any outside modifications. Firefox’s correct rendering on Windows with regedits but not on any other operating system, nor on Windows without regedits leaves me to believe the credit lies not with the browser but with whatever effect the registry edits has on Windows. Still it did a decent job and it handled the reordered combining letters no worse than the properly ordered combining letters. Opera was neck to neck with Firefox for all the same reasons, but fell behind for its odd handling of the test cases on the Mac.
Konqueror and Safari wind up being the most puzzling since they use the same rendering engine. In theory, then, if one works, so should the other. I may try to rustle up other versions of Konqueror on other Linux distributions and if anything else might be going on.
Microsoft’s Registry Edits
Microsoft details up to three registry edits that are necessary to to set up Windows NT, 2K and XP for unicode functionality. (Vista comes already set up, and does seem to actually work.) The first edit is the most important. It enables Uniscribe support which is what Windows applications use in order to be able to render Unicode characters. It’s responsible for rendering input text, for substituting character variations according to context, and for ordering displayed text based on text flow direction.
The second registry edit adds support for the supplementary plane characters in IE (which I will cover in the next post). The third adds the ability (in Win XP only) to specify a default font for supplementary plane characters.
Presumably the first edit is what enables Firefox and Opera to be able to handle combining diacritics. I’m at a loss to explain IE6’s behavior, unless a fairly recent patch to it enables the same thing. Unfortunately, I don’t have an earlier unpatched IE6 to test that theory out with (if I’m correct, such an earlier version of IE6 would need this registry edit to display properly). Installing certain language packages will also install the Uniscribe module “behind the scenes,” but as I had the same language packs before and after adding the registry edits, that doesn’t explain this either.
The full instructions for doing the registry edits may be found here.
Excellent Resources
For much more informative commentary, check these pages out:
- Benefits of the Unicode™ Character Standard
- Character Sets And Code Pages At The Push Of A Button
- More about the character concept (the entire article is excellent as well)
- Combining Diacritical Marks



















