Unicode Ate My Brain

by Smarasderagd

lightly edited by John Cowan

What you are about to read is a slightly edited set of messages I wrote on a local bulletin board system. Be warned that my understanding of Unicode and UTF-8 is cursory at best, so don't rely on anything you read here.


So, you want to hear about Unicode, do you?

Har har. You fool. FOOLS . You are all just doomed to hear me recount the story of the Great Character Set Internationalization Wars. Which aren't over.

Long long ago in the predawn mists of time (i.e. some time last century) a numerical code was devised for the transmittal of information between computers. And it was called the American Standard Code for Information Interchange, abbreviated ASCII. It had many excellent tendencies. The digits from 0 to 9 had codes which could be translated into their numerical values almost effortlessly. Most (but not all) of the characters available from the standard typewriter keyboard were in it. Later it was extended to encompass characters many typewriters lacked. But we shall consider this another time.

Time passed, relentlessly, and, well, America neglected to conquer the rest of the world. Germany was partitioned, but German still flourished. And along with it those pesky umlauts, and that weird mutant B-looking thing that is actually two s's stuck together. No problem: to represent the umlaut's modification of a vowel, add an e. And you can always write "ss" instead of that B-like freak.

France had been ravaged, but French flourished. Along with the acute, grave, circumflex accents, and the cedilla, at which point many of the proponents of ASCII began to feel the first faint stirrings of doubt. Sure, you can represent the accents by putting the quote, backquote, and caret after the accented letter, although there was potential confusion between a quote's use as an accent or as an apostrophe. But that cedilla -- there's nothing in ASCII that looks like it at all, unless you want to go completely off the beam and nominate the comma -- but let's not get into that, or we'll be forced to dwell on the distressing European tendency to use the comma as a decimal point, and to avoid using the period in representations of numbers entirely.

Finland had suffered Finlandization, but Finnish still flourished, and at this point most ASCII lovers threw up their hands in dismay, while the Finnish reluctantly appropriated some of the less used characters of ASCII to represent their own set of accented characters.

More and more, ASCII begins to look a little, well, cramped.

So here things stood with ASCII, mostly effective, but getting rather worn around the edges, when ...

China. Japan. Islam. Russia. Korea. India. Africa.

At which point it became clear that while there was probably a need for an International Standard Code for Information Interchange, ASCII was not it. Something new was needed. Something new was invented.

Several new things, in fact.

And that was the problem.

Because while certainly Latin-1, ISO-this and wide characters and multi-face character sets which reserved a chunk of space for each language and so forth and so on were all well and good, there remained the question as to which one would rule the world. Latin-1 wasn't up to it. It was a noble and thoughtful effort at enlarging ASCII to take in all of the European languages, but there was no way it would represent any of the three separate character sets used by Japanese, or the unearthly welter of ideograms in even the revised and simplified written language of (ahem) continental China.

[Practically every country had its very own character set, sometimes more than one, and how to handle more than one was dictated by the eeeevil ISO 2022, which mixed all the existing character sets into a One Big Jumble, with lots of escape sequences to switch between them. Heroic things were done with 2022, even to writing a version of emacs that understood it. But it was just too complicated, too clumsy to use, and too verbose. --JC]

There was a solution, considered by some to be an abomination, but which held great promise.

Its name was Han unification.

Han unification can trace its origin back to the strenuous efforts made to bring written Chinese into something resembling the modern world. Boiled down to essentials, when applied to the vast welter of international characters as a whole, it means this:

If it looks the same, it is the same.

Or alternatively:

A difference that makes no difference is no difference.

And thus was Unicode born. Folding multiple character sets unrelentingly together, it managed to squeeze a serviceable international character set into 16 bits. [Now with occasional deviations into the world of 21 bits. --JC] Data managers everywhere (well, not really, but figuratively speaking) were somewhat relieved. They "only" had to get a 4 gig drive to replace each 2 gig drive. But you had to wonder if it might be possible to shrink things down even more than that, just possibly? And what about byte order, mhm? What about that?

Enter UTF-8. Its goal: provide as smooth and unruffled a transition from ASCII to Unicode as possible, while dealing with the byte order problem, and allowing programs to scan character streams without guesswork. Its goal: achieved.

The first 128 ASCII codes are the same as they always were. If the high bit of a character is set, though, this is a trigger. The high bit pattern in it (and succeeding characters) determines whether this is a 2- or 3-byte sequence, and what byte in that sequence. Since all the bytes have their high bit set, and a unique high bit pattern, a program can tell by examining the high bits of an individual byte whether it's in the middle of a multi-byte sequence, and where in that sequence. Sound unimportant? Heh. I guess you've never had to go backwards in a string before..

The other crucial advantage is that zero still means end of string, since it can only appear as a single byte, and never as a multi-byte sequence. This has the pleasant effect that naive C programs can treat UTF-8 streams as streams of ordinary 8-bit stuff, without worrying about the Unicode aspect if they don't want to.

My hovercraft is full of eight-bit eels.

The interminable saga of Unicode and UTF-8 is almost at a conclusion. After years of high-falutin' international jockeying, here's where I come in.

I'd gotten ahold of a bunch of X Windows programs [born on the mutant operating system called "Plan 9 from Bell Labs" --JC] which supported UTF-8 and Unicode. An editor (called sam) and a command line window (called 9term). I also had a shell that could tolerate all this high-bit crud without having a gagging fit (it was called rc ). And there the matter rested.

For you see, the story's not quite over. There's still the matter of input, which anyone seriously wanting to use Unicode would have brought up by now. How do you enter one of a set of thousands of characters with a 101-key keyboard? Could the common ones be a little easier to type, please?

I'd had the set of Unicode-capable programs for quite a while, without actually making use of Unicode (they each had sterling qualities which made them useful in their own right) for the simple reason that I had no idea how to cause these wacky characters to appear. Plus the Unicode font set was fscking huge, which is something you might expect of a font set that includes all the ideograms in standard Japanese, plus hiragana, katakana, and Greek and Cyrillic and ... well anyway.

Well, one fateful day, I was sitting at my console (at home, as it happened, but I have the same environment set up at work) when the thought occured to me, faintly but persistently:

"Maybe there's a manual page for this."

I'd seen some passing description of sam's mouse interface in its manual page, but in general I found I had better luck clicking the mouse and seeing what happened. Working this way means your life is full of pleasant and/or embarrassing surprises, such as the time I accidentally found out how the string searching facility in 9term works, which increased its usefulness tenfold, and prompted intense chagrin at not having found it earlier, particularly since it was mentioned in the manual page. Still and all though, I'd come to distrust the documentation.

But still. "Maybe there's a manual page for this."

So I looked. And there was. How embarrassing. You could enter every character in any European language with ridiculous ease. And other stuff with only slightly less ease. So I tore off to work and socked the whole Unicode font into place, and soon enough, despite all the warnings from friends and family (actually, no one knew I was doing this, so I got no warning, but "all" includes "none" among its possibilities, so there) I became a hopeless glyph junkie -- chess pieces in my font. I delved into the theory of partitions to come up with a way to show them in every possible order. Well, I exaggerate slightly, and it would have been a lot easier if I hadn't been stupefied with caffeine at the time.

This little romance of many significant bits is nearly done. Prepare yourself, as I confess the final madness.

After I'd played around a while with the panoply of wacky characters available, I became gradually aware of just how pathetically wimpy Unix is when confronted with Evil, Nasty Characters With Their High Bit Set. ls refused to believe in them. vi spit them out. Even emacs kept its distance, hedging every one with a backslash and an octal code. [These tools have improved a bit since then. --JC] So few places where a forlorn multi-byte sequence could find peace, acceptance...

Converting the heathens abroad was too large a task. I contented myself with making my own revolution. ls was against me. Very well. The file system would have to go its own way. No other editor would display this freakish stuff. No problem, I only ever used sam, anyway.

So I made my statement where I could, in defined shell functions. And named them, not after chess pieces. The names were themselves chess pieces. ♖ (the white rook) roams over the source tree, reconfiguring and updating, putting links in place, preparing the way. ♘ (the white knight) moves carefully and justly, if a little crookedly, compiling one file at a time. Fast and dangerous, ♞ (the black knight) sweeps across the network, compiling on all the machines at once -- just be sure you don't get caught.

So for now I am in my private paradise, although the walls are too high to see outside, and I worry that everything may collapse on me, I can say:

I ♡ Unicode.

The End.

Smarasderagd's original version can be found at http://www.vex.net/~smarry/unicode-story.html