&headache;
- mezzoblue | Glyphs, which demonstrates how troubled the unicode character set is at the present time. For example, my Win2k Pro machine at work showed a partial set in all browsers; with Opera only missing about three. However the really odd thing was that Firefox/Netscape 8 showed a different version of the glyphs.
- The Trouble With EM 'n EN (and Other Shady Characters): A List Apart. But you've already read that one, right? It's a nice roundup of the different dashes, hyphens and other typographically correct characters.
- A Simple Character Entity Chart | evolt.org
- A to Z Index of Unicode Characters: starting with 'A'. Yep. The lot. You can also use the Unicode Character Search or view Unicode Character Categories.
- HTML Codes - Table of ascii characters and symbols, a simple guide to basic characters (the most commonly used ones, anyway).
- This is where the magical link should go. You know, the one that clearly explains which character encoding you should use in your web pages and why. When I find it, I'll let you know :) I've been told both "use iso-8859-1" and "use utf-8" but when I ask for more details people shuffle their feet and look into the middle distance. Well ok, so there are some exceptions: Quick guide to UTF-8 - Anne's Weblog about Markup & Style makes a pretty good case for using utf-8.
In short... character set encoding and special character display remain a pain in the butt.
Update: this t-shirt says it all (Mac version also available).


It is true that there is still a requirement for some black magic in getting character encodings right. The things I have seen go wrong most often:
JAWS is apparently not able to handle utf-8. So if I want to write my mail in spanish, to people who are using older systems, I either have to drop accents (which causes genuine ambiguity in my text) or switch that mail to iso-8859-1. If I want to write in Hungarian, Greek, Russian, etc. it is of course a problem. So I set my environment up in utf-8 by default and live with swapping for some email.
IRC servers accept what is sent, and don't have a formal way of negotiating character sets or even saying what they accept. Same problem as mail, effectively.
The big one I have found is servers not being set up correctly to know what they are serving. There is a tutorial on configuring character-encoding in Apache that W3C's internationalisation group produced as part of their extensive collection of materials on character encoding. I find most of these documents are actually far more friendly and readable than the average W3C document. But there are some genuine issues out there.
What a smart system would do is auto-detect the encoding (most modern systems do in fact manage this at least to some extent). But yes, there is more work to be done...
-- chaals
Post a Comment
Got something to say? Leave a comment! If you don't have a Blogger or OpenID account, please use the Name/URL option.