No Clean Feed - Stop Internet Censorship in Australia

dashing into trouble - why html comments break in firefox

[Or, When Minutiae Attack!]

There's a perennial question that pops up on email lists and from other developers. The question goes something like this:

"This page works fine in everything except Firefox and I can't tell why... it's showing an HTML comment as raw code..."

the problem

This problem usually boils down to a quirk in the way Firefox handles HTML comments. Most browsers only treat --> as a closing comment in HTML. However, Firefox also treats any instance of -- as a closing comment.

So, if you have a comment with two or more adjacent hyphens, you're in trouble. Both of these are out:

<!-- --------------- Blah --------------- -->

<!--
<p>
Blah blah blah -- then blah.</p>
-->

Firefox will display the comment as raw code, instead of hiding the comment and its contents. The late Netscape Navigator did this too - in fact I first saw the problem in NN4, back in 2000.

the solution

The solution is simple, even if it's not always convenient: don't put adjacent hyphens inside an HTML comment. That's fine and dandy unless you have content authors who have a habit of using two hyphens instead of an em dash!

In any case, you either need to remove the adjacent hyphens, or if it's appropriate you can convert them to more correct characters.

If your page's content includes double hyphens and you're not allowed to modify it, then you're not going to be able to comment blocks of it out. Yes, that is indeed annoying.

is firefox wrong?

Technically, very technically, Firefox is right. The HTML 4 specification defines "--" as the comment delimiters; while "<!" and ">" are the markup declaration delimiters. From the spec:

White space is not permitted between the markup declaration open delimiter("<!") and the comment open delimiter ("--"), but is permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">"). A common error is to include a string of hyphens ("---") within a comment. Authors should avoid putting two or more adjacent hyphens inside comments.

Information that appears between comments has no special meaning (e.g., character references are not interpreted as such).

Note that comments are markup.

...so, Firefox is not "wrong". It's just following the spec to the letter (or hyphen, as the case may be). The other browsers have gone with a more human-friendly interpretation of the spec.

so are all the other browsers wrong?

I don't think the other browsers are definitively wrong either. They still comply with a different interpretation. Personally I read the specification the same way: "any instance of -- followed by > or whitespace and > is a closing comment".

According to this interpretation, "-->" and "--  >" are valid closing comments, but "-- blah" is not. It's also a reasonably logical approach to say that since a closing comment should be "-->", then the browser should ignore anything which is not "-->". That's pretty much what a comment is there for, after all - to ignore stuff.

is the spec wrong?

Well the spec can't really be "wrong" I guess. It is what it is. But in this case I think the specification is a bit illogical:

  • I don't see the sense in random exceptions to rules - why specify vagueness? "this is the closing comment tag, except when it's not".
  • Why allow whitespace in the closing comment, when it's not allowed in the opening comment? If whitespace was prohibited the only valid interpretation would have been the complete "-->" and the whole problem goes away.
  • The "common error" note just confuses the issue. It does not say why including adjacent hyphens is an error, nor does it define any specific error handling method.
  • The double-hyphen approach can't be a technical requirement for producing a rendering engine, since the other browsers are able to restrict closing comments specifically to "-->".
  • It's impractical to expect that commented content will never contain multiple adjacent hyphens. Sure, we can live without hyphens in comments; but we shouldn't dictate content based on markup (no matter how small a detail it is).
  • And finally... it's irritating, so I'm going to be cranky at the spec ;)

Besides that, as the HTML5 spec notes, HTML has always been implemented by browsers as a language in its own right. There's no need to slavishly follow anything else. The more human interpretation could just as easily have been specified.

But then, comments appear to be a blind spot in W3C specs - how I curse the lack of a single-line comment in CSS... but I digress.

will html5 clear it up?

No. HTML5 actually makes things a little harder to remember, by addingdocumenting (hat tip zcorpan) the restriction that a comment can't end with a hyphen either:

Comments must start with the four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may have text, with the additional restriction that the text must not contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a U+002D HYPHEN-MINUS (-) character. Finally, the comment must be ended by the three character sequence U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).

What's the saying about laws and sausages?

conclusion

Perhaps there's some deep-seated syntactical reason for the double hyphen approach. The HTML4 spec's warning about "a common error" hints at some underlying logic without actually revealing it. Maybe it's related to the formatting of DTD comments. Maybe it was just a Netscape quirk which got turned into a standard. Maybe it's totally random.

It seems unlikely that Mozilla is going to change this particular detail, particularly as it's not a "bug" to follow the specifications; and besides, Firefox 3 Beta 2 still has the quirk.

Still, Firefox does correct omitted background colours by defaulting to white - so they're not above "helpful" rendering tricks. So maybe there's some hope on that front.

But in the meantime, no matter what we think we just have to live with it. It's yet another bit of web development minutiae to file away in your head, for the day you see it happen.

Labels: , , , , , , , ,

what i want from a new markup spec

So it has come to pass that the W3C has decided to take the WHATWG's HTML5 on board. It will form the basis of the W3C's HTML5. The goal is to have a public draft by June - yes, this year. Given that the spec now has to endure the full process of the W3C we'll see how that goes.

Anyway, this got me to thinking: what do I really want from a new markup specification? I've talked about this before but I realised that there's a difference between what I want and what I actually hope for :)

Ultimately it comes down to quite a small subset of the overall picture - the things I genuinely wish for in daily life. There are a few elements I'd like to see created or simply supported consistently by browsers.

basics

These are the basics, the minimum additions to fill in some blanks left by HTML 4.01.

  • An extensible, contextual heading/section system
  • A way to associate a CAPTION (or LABEL) with images and lists
  • Footnotes (which are really endnotes on the web)

It's a short list, since the reality is that the lack of decent CSS support impacts on my daily life far more than the limitations of markup. Frankly most developers out there still haven't mastered the semantics of HTML 4.01 so it's not like adding more elements will stop people making tag soup.

Meanwhile, semantics geeks like me will keep searching for the secrets of semantic alchemy with compounds and microformats. Where the markup is deficient we have ways of adding more meaning.

Although this is not an addition to a spec... I'd like to see real support for OBJECT so (amongst other things) we can replace images with the complex explanatory content required for complex graphics. Since certain popular browsers can't cope with this element, we still essentially don't have it.

headings

On the topic of headings, HTML5 does not do what I want since it still relies on H1-H6. I gather the HEADER element is meant to do some kind of section marking but frankly on a first reading it doesn't make a heck of a lot of sense. It certainly doesn't introduce any obvious practical benefit.

XHTML2's H and SECTION system is exactly what I want. I regularly wish I could write a code fragment with a heading, without having to know the heading rank. With the H/SECTION system, I could just define the fragment as a section and know that the heading rank will be sorted out in-situ.

If you maintain a small, stable site, headings may not have ever been an issue. But if you have ever maintained the code base for a very large site, you're probably nodding your head ;)

Even for a small blog headings are a problem. In your average blog the top two heading ranks are probably handled by the site template and CMS; but subheadings in actual posts have to be written in directly with heading tags. So you're probably inserting H3 tags right into your content. Too bad if you later want to change the post pages to have the post title as the H1 - then you'd have a jump from H1 to H3. You either have to stick with the original structure; or you have what I consider an invalid heading rank jump.

Consider the same blog, with H/SECTION... you can adjust the structure around the post as much as you like - it doesn't matter. The sections and corresponding heading ranks take care of themselves.

Headings aren't glamorous. They're not uber-funky AJAX-friendly form inputs which will sparkle in the sun and inspire dancing in the street. They are bread and butter elements which we use every day. HTML has never made them easy to work with, so like it or not they would be a killer app for a markup spec.

exclusions

In addition to what I do want, I think it's important to think about what a spec excludes. I think it's high time for specs to stop weakly deprecating things and flat-out remove them. I'd kill off the semantically neutral and visual-design-based elements - FONT, B, I, S, U etc... and definitely no get-out-responsibility-free cards for WYSIWYG editors!

The spec should just have them treated and rendered the same as SPAN. They're all semantically meaningless and can be replaced either with CSS or semantically-meaningful elements.

I should note that by my reading, WHATWG's HTML5 deals with B and I by creating semantic meaning for them. While that approach has some merit, I doubt the majority of developers will alter their usage according to the new semantics so those elements' usage will just be incorrect for new reasons. If everyone out there was to adopt the new semantics, I'd probably support the approach :)

wish list

These are things I want, but in the balance of things they're not the first things I'd argue to have included. That's the basics list :)

  • A dedicated caption or group label for sets of radio buttons - FIELDSET and LEGEND don't really work for long descriptions.
  • A drag-and-drop form input which is also keyboard accessible - keystroke/click to pick and keystroke/click to drop. Drag and drop is a useful paradigm but the possible solutions at the moment are not much good for keyboard or screen reader users.
  • An element to enclose extra info for assistive technology users, something a little like NOSCRIPT. Having to use CSS tricks to hide assistive content creates a clash between content and style; not to mention putting your content at risk of Google blacklisting. An element named something like ASSIST could be ignored by search engines and enabled by assistive tech like screen readers. [Note - this is a pretty sketchy idea, no doubt there are all sorts of practical issues. I'm not saying it's perfect. It's just that we need some legit way to give extra info to users who need it, without getting blacklisted from Google. A dedicated element might be the way to go - although proper support for OBJECT would help an awful lot with accessibility it still won't help the search engine issues.]

Another short list. I wouldn't say no to specific elements for navigation, but I don't think they would really fix problems. Accessibility basics give way to usability issues - if your navigation is hard to distinguish from content, it's more of a usability issue than a markup issue.

HTML5 has elements for navigation, document content, header, footer etc... I'm not a huge fan of the naming system but I can see the potential benefits. Still, such elements aren't really priorities for me. I'm still going to give users skip links and Google has no plan to reward semantics anyway. If - and it's an if - screen readers were to make use of these new semantic elements then I'd probably use them. But screen readers lag behind and users often can't afford the latest versions anyway, so we're still going to be using skip links anyway.

all i want for christmas...

So basically what I want from a new spec is a few basics that were missed the last time around. I'm not actually hanging out for bells and whistles, although HTML 5 seems full of them and no doubt we'll happily use them.

Has reality lowered my expectations? Perhaps. Will I be glad of some kind of update - something, anything - after all these years? Almost certainly. Remember it has been more than seven years since XHTML 1.0 became a recommendation. That's 70 web years - a long time between updates.

After all that time it seems that most developers had lost faith in the W3C. Taking on HTML 5 seems like the only rational way forward and it was probably the only thing the W3C could do to regain a little bit of relevance in the world of markup. The browser makers certainly seemed to have jumped ship to WHATWG's HTML 5, or were quietly preparing to do so.

When I first heard of the WHATWG I thought it was unnecessary - maybe even a little irresponsible - to break away from the W3C. Many years later I'm glad they did.

So anyway with a June deadline, here's hoping we have a new HTML spec in time for Christmas. Santa... I'll be a good boy, I promise.

Labels: , , , , , , , , , ,

thoughts about html

So, there's a coordinated call for feedback on the WHATWG's activities. There's a lot to cover in the call to action, so I'll just start with some thoughts about HTML...

I haven't read the WHATWG HTML 5 and Forms 2 specs "properly", so much as skimmed them. Forgive me, they are big specs with draft status from an as-yet unrecognised group. I don't read W3C specs for fun either ;) So this is mostly off the top of my head, you'll have to excuse me if something is already covered and I've missed it.

Headings and sections

I rather like the XHTML 2 version of headings and sections, as opposed to HTML 5's current system which seems to inherit all the problems of HTML 4 and none of the advantages of XHTML 2.

  • Why limit things to just six heading levels?
  • Why not declare hn as an extensible set of headings?
  • Why use specific headings if you're using sections - just set a heading for each section and let nesting take care of the rest.

I'm not a fan of the W3C's specific example though, since I feel that each section should start immediately with a heading. I'd like to see the strong sections removed. But otherwise this system seems simple and elegant to me (although maybe I'm just weird - I'm aware that's a possibility!):

<body>
<h>This is a top level heading</h>
<p>....</p>
<section>
    <p>....</p>
    <h>This is a second-level heading</h>
    <p>....</p>
    <h>This is another second-level heading</h>
    <p>....</p>
</section>
<section>
    <p>....</p>
    <h>This is another second-level heading</h>
    <p>....</p>
    <section>
        <h>This is a third-level heading</h>
        <p>....</p>
    </section>
</section>
</body>

In anticipation of the argument "documents shouldn't be so big they need more than six levels", I'll simply suggest you go and convince all the world's lawyers and legislators then get back to me :) Besides, it's entirely possible to have more than six levels in a short document that would not be suitable for presentation in multiple web pages.

Better lists

I think <ol>, <ul> and <dl> should all have a <caption> element or a way to explicitly associate a heading. We're grouping information together after all, I think it makes sense to be able to explicitly state what the grouping is all about. It's one of the really useful things you can do with tables.

I also think ordered lists need more sophisticated numbering systems - we should not have to resort to CSS or use invalid code! eg. we should be able to start an <ol> from, say, 11; because 1-10 were on another page. I'm specifically thinking of search results which are commonly split into multiple pages, yet each page should not restart the list count . Currently it's only valid to set the value of each <li>, which is absurd - so the HTML 5 spec's .

Labels for radio button groups

I don't think HTML 4.01 provides a satisfactory method of labelling/captioning a group of radio buttons. Each radio button gets a label; but really the group needs something to describe the purpose of the set of inputs.

You can use a <fieldset> + <legend> combination for short descriptions, but it feels like a hack (not to mention the practicalities of hacking CSS to get browsers to display long legends!).

Captions for images

I'm not quite sure how this could be approached; but I think a visible caption for images would make sense. Hidden text could then be more akin to longdesc than alt. The <object> element provides an excellent model for alternate content, but not a caption.

The cite attribute

While this is ok, I do wonder at the requirement for a URI. How do I choose a URI to cite Shakespeare for example? What one single URI makes sense? Plus long experience shows us that URIs don't live forever - who remembers to check their cite URIs?

So why not an attribute for the name of the person and an attribute for the title of the work they are being quoted from? Sure, there's potential for ambiguity, but don't try to tell me a URI could not lead to a document which talks about ten John Smiths.

<p creator="covenant" work="we want revolution" cite="http://www.google.com.au/search?&q=%22we+want+revolution%22+covenant+lyrics">we want revolution<br />
constant evolution<br />
start your engines blow your fuses<br />
burn the bridges for the future<br />
this is our solution</p>

The <cite> element

<cite> doesn't make any sense to me either, since there's no explicit association with a quote. Take the example from the HTML 5 draft:

<p><q>This is correct!</q>, said <cite>Ian</cite>.</p>

So long as there is only one Q/CITE pair in the entire document, we're ok. After that, we're just guessing - and while a human might guess fairly well, an indexing system has no grasp of human context. So, perhaps a for attribute is in order:

<p><q id="ians-assertation">This is correct!</q>, said <cite for="ians-assertation">Ian</cite>.</p>

The <iframe> element

Why keep <iframe> in HTML 5 when the spec also includes <object>? Straight question. From a quick read, <object> seems to take care of everything that <iframe> can offer.

More...

The HTML 5 spec includes quite a few all-new elements such as <nav>, <x>, <m> and <progress>. Some are relatively logical, but others like <progress> just seem very odd to me. A progress bar is not a permanent content item, it's a temporary state. However I'll save real discussion of these elements for another day.

So what do you think? Join the discussion!

Labels: , , , , , ,

about

Web development and standards, as seen by Ben Buchanan.

subscribe