HTML and XML

W3C AC meeting, 2008-05-19

The goal of this document is to investigate the possibility, over time, of healing the rift between the HTML5 and XML technologies, to achieve interoperability between software and markup which are currently on two sides of the fork.

The method is is to try to understand the motivations of the various positions, and address those at source, and not to use them to decide that a particular fork is "right".

The content of this essay is accumulated from many sources. It was given in large part as a talk to the May 2008 W3C Advisory Committee meeting, posing a series of questions about future directions for HTML. Discussion of this topic is directed to the W3C TAG list, www-tag@w3.org (archive) .

Introduction

The development of Web technology advances at different speeds on different fronts and different times. Occasionally it seems that some strategic thinking is necessary in order to ensure that the system as a whole will continue to work well and evolve smoothly. This is one of those times.

The fork

The purpose of this essay is not to detail the history, but let me start by summarizing quickly to set the context. HTML is the most widely deployed document format by a long way in the history of computing. XML, also, is very successful, being a framework for many formats public, private, in many different applications. As a simplification of the original SGML, on which HTML was based, XML allows code to be lighter and faster than SGML systems, and makes it easier for developers. We have seen in recent years a hiatus in the development of HTML, followed by a more recent surge along two branches. One branch of HTML, XHTML, which switched from using SGML to using XML, provided various new features, used the XML namespaces extensibility system, but was not widely deployed in the dominant Web browser, Internet Explorer. Another branch, HTML5, has been specified with the explicit goal of describing exactly the rather contorted behavior existing browsers implement to handle the legacy of Web pages found in practice on the Web, as well as introducing a different set of new features (video tags, etc). While it provides for an optional XML serialization, HTML5 does not in general use XML and specifically does not use XML namespaces. Below we unravel the separate criticisms of XML and XML namespaces.

The existence of the fork is a serious problem, both because a fork in standards is fundamentally costly for the whole community going forward, and because of the technological problems which are highlighted in the issues which each branch has with the other branch.

Arguments for cleaning up

Now, there may be extreme versions of the HTML5-fork style which maintain that everything is fine, and that the mess is just life; we will have to live with liberal parsers forever, and that is the only realistic approach. However, not only is the code stack horrible to maintain, but pages that are not well formed are hard to maintain, process, and reuse.

Also, there is a whole world of XML-based software in the enterprise, some of it SOAP-based services, some of it more document-oriented, whose developers could not imagine for a moment deviating from the XML path by allowing this sort of liberalness, as systems would just stop.

Can we assume that the HTML Web and the XML enterprise systems will be non-interoperable worlds? Possibly, but with a constant cost, whenever attempts are made to move data from one to another, to embed some HTML product description into an order, for example.. The boundary will never be clear as in fact there is an overlap. Some suggest making a version of SVG which is in HTML5 (liberal) format, while others use XML engines to process SVG. People embed HTML in RSS and Atom feeds and RDF feeds (using RDF's XMLLiteral datatype) and RDF parsers don't have embedded HTML5 parsers, so it has to be well-formed. And so on.

To continue to promote messy code on the Web is to create problems and pain later on. To promote clean XML is a current pain for real users which they will not put up with. How can we escape from this? To understand possible paths forward let us look at how the language is typically extended, on each fork.

Centralized HTML extensibility

The HTML community has not embraced URI-based extensibility. In fact, decentralized extensibility is not a general goal to many. This is not surprising. HTML is the most widely deployed data format in history by a long way. Every Web browser is expected to be able to handle it. Its evolution is a form of ongoing negotiation between users, Web developers, and browser developers (and their management). The HTML language itself has a unique place among other languages. The model of a large number of small overlapping communities, which was the target of the RDF design, does not apply to the HTML language.

It is not surprising that requiring each HTML document to start with a namespace declaration irks those for whom the whole world is HTML. When everyone is deemed to know the HTML spec, why have it vectored to by the XHTML namespace and the namespaces specification?

Decentralized extensibility allows new modules to be added to a language by third parties, but why bother when the modules which are generally proposed for addition to HTML, such as SVG, MathML and XForms, can be counted on the finger of one hand? In this case the HTML design authority can simply add new modules themselves. If extensions are needed, then they can just be added to the specification. The list of modules can be made available to everyone, as all systems are expected to be programmed with an inbuilt knowledge of the HTML spec.

While browser plug-ins can be dynamically downloaded from the net, in general HTML extensibility from one level to the next has not been done in that way at all. Using the foundational rule that browsers have from the beginning ignored tags they did not understand, new HTML tags have been added in a calculated way so as to hopefully maximize the benefit and minimize the damage to the community as a whole. Historically, the HTML working group did not make any commitment that the meaning of tags would not change over time, only that change would be made as responsibly as possible.

Decentralized extensibility

Let us investigate the philosophy, now, of the XML branch.

In a world in which there are very many XML-based technologies, and many many groups needing to create new ones and extend old ones, a major motivating requirement has been decentralized extensibility. This is the requirement for a group to be able to define the terms involved in the new technology without having to get an audience with and agreement with central committee. (Examples of centralized extensibility include for example Dewey decimal system, the Library of Congress cataloging system, and the international phone number space).

URI-based extensibility

In the Web environment, decentralized extensibility can be done using HTTP URIs. Basically this means that any group which can lay claim to some (normally HTTP) URI space can pick a URI for a new feature, without having to go through any centralized clearing house (other than the domain name system). It also means that the namespace URI can be used to give pointers to developers, or, with very persistent caching, machines, willing to learn about the new features.

Historically, namespaces were actually a requirement on XML Namespaces imposed by RDF, which was developed in parallel with XML RDF is aimed at a multitude of communities all independently agreeing on different though connected set of terms, and then being able to merge their datasets which use these terms. URI-based extensibility has been very successful in the world of RDF itself, as many ontologies have been developed without central coordination by the RDF working group, which indeed closed long ago. One might argue that arbitrary non-RDF XML applications cannot use URI-based extensibility in the same way, as they do not have the very powerful "ignore triples you don't understand" model of RDF, but a counterexample would be the use of independent namespace-qualified tag names in SOAP messages, headers and content. Another example would be the EXSLT group which uses use namespaces to extend XSLT.

Follow-your-nose principle

The use of HTTP URIs for extensibility is not just a question of allocating names unambiguously. The fact that HTTP URIs have ownership means that there is a responsible authority who can be traced and called upon to explain what a term is supposed to be for and how it relates to other terms.

In fact, as we use HTTP URIs, one can in real time look up that information. Although, for the sake of the servers, the looking up of a namespace document should be viewed as an installation process with a permanent cache, a machine can usefully pick up information at run-time which will allow a system to usefully process a vocabulary which it has not before encountered. This again is much more developed in the RDF world, where ontologies can contain enough information for a new user interface to be created on the fly.

The follow-your-nose principle, then , allows a form of bootstrapping. Like any bootstrap, though, it needs a base to start from. In this case, there is a core set of specifications which a client has to understand in order to do the bootstrapping. Examples of these core specs are Ethernet, TCP and IP, DNS, HTTP,, the Internet Content Type (also known as MIME type) registry.

By one model, a content type of text/html in a HTTP response indicates an HTML document. A content type of application/xhtml+xml indicates an XHTML document.

By another model, a content type of application/xml indicates an XML document, and if, within such a document, namespaces are used for the document element, then the XHTML namespace URI (http://www.w3.org/1999/xhtml) within it indicates an XHTML document.

Recent controversies

Recently, controversies have arisen as various groups have attempted to create new feature sets suitable for adding to HTML and similar languages. One of these is ARIA, which allows a Web page to be annotated to explain the user interface function of various elements, and another is RDFa, which allows a Web page to be annotated to explain the meaning of various elements and add more data. Each of these technologies, like many other technologies one can imagine, works by adding new attributes (and sometimes elements) to the markup.

In ARIA, about 30 new attributes are added. In the XML fork, in one design, these were added using a aria namespace, as, say, aria:foo, while in the HTML5 fork, they were added as aria-foo in the HTML namespace. The arguments about these choices were fairly long and complex, and involved for example discussions of what exactly legacy browsers would do with the DOM in each case. The users of the spec are not just document writers, but also those who write scripts to access and interpret the attributes. In any event, there was no way one could write the same thing in both language both at the markup and the script level, it seemed.

In RDFa (derived from "RDF in attributes"), the requirement was to add new attributes to allow semantics to be given for embedded HTML data. The GRDDL specification, an existing recommendation for pointing to a transform script which extracts RDF from a document, is a possible point of leverage in the follow-your-nose story, if one takes GRDDL as being, for semantic Web clients, as being part of the bootstrap core functionality.

In the XML fork, extensibility is achieved using namespaces, but in the non-XML fork, there are a number of less obvious options, which include the addition of all new attributes to the the HTML world, as though they were in the HTML5 spec. In this case, the social question is: can a group just announce that it is adding attributes to the HTML namespace*, or does it have to get it put there or at least agreed with the HTML design authority? In the normal world of standards, the latter is the rule, as each specification needs, it is felt, a coordinating body. In the HTML world, though, introduction of new tags by vendors, and new attribute values (such as rel="nofollow") is often done without such coordination; the 'marketplace' decides which tags live and which don't, and the low probability of collision replaces the use of clearing houses for new names and values. [fn1]

In practice, then, ARIA and RDFa have proposed to add new attributes (and/or elements) to HTML, deeming them to be added by dint of the existence new specifications, seeing whether they get adopted by a community of readers and writers once specified, and seeing whether they appeal to the those involved in the mainstream HTML language evolution to be worth either inclusion or reference.

So, can we just use a different model for HTML, because of its special place among languages?

While these are two recent examples, one soon discovers many examples of the development or integration of new technologies in this area:

The SVG community has made a very modular specification intended to be mixed with other markup languages, originally using namespaces.

The Mobile HTML specs have used XHTML very cleanly, and XHTML has been integrated with SVG in some cases, following the XML fork. SOAP systems enclose all kinds of XML in their payload, and can include XHTML within that where textual data is present in a remote service invocation or response.

Meanwhile, by contrast, suggestions have surfaced that SVG should be integrated into HTML5 simply by pouring the SVG tags into the HTML specification, using no explicit extensibility controls at all.

So it is impossible to draw a line around HTML as a special case isolating it from the mass of different communities developing their individual applications. So what can be done?

Scale free space

The Web is, as I have mentioned before, composed of many different communities of different sizes, and often is seen to have a scale-free properties. That is, for example, that there is no 'typical' number of inbound links to a page, but the distribution follows a power law. This is partly a measured phenomenon of the Web; it is also a phenomenon which occurs in many other systems, and also I have an unproved hunch that it represents a form of optimal arrangement for society to function effectively. It may be the optimum tradeoff between the ungainliness but great interoperability of a central language and the agility of small communities using of a Babel tower of different languages.

It is a characteristic feature of such scale-free systems that they have one leading player, closely followed in the popularity ranks by other players in decreasing popularity.

In the case of vocabularies on the Web, we have the HTML as the largest scale, in which tags are just tags and everyone is supposed to know them. One could argue that SVG actually belongs at this level and should be and will be as widely deployed as HTML.

At the next level we have languages which are not HTML but still address the needs of very large communities. SVG, MathML and ARIA are examples.

There are many medium-sized communities. The FaceBook Markup Language (FBML) is an example of a vocabulary proposed by one website, though a significant site. Atom feeds for various things can be considered at this level. Also, enterprise systems include many many XML namespaces which are developed, for example, in SOAP-based applications.

Continuing on (roughly) down the scale we get to vocabularies for protein scientists and history museums, for scout troops and bird fanciers; we get vocabularies invented for today's experiment in a lab, for the import of a particular spreadsheet and so on.

It is reasonable for us to not just sit back and admire the scale-free nature of the space, but to actively engineer for it. What does this mean in this case? It means that we should engineer the system with an understanding that HTML is a dominant language (at the moment) used by a very large community of individuals, but with an understanding that there are many other communities, many other languages and specifications, and that these often have to be able to connect with the HTML architecture.

I would like to investigate the possibility of us deliberately designing ourselves a system which is optimal, in that it addresses the needs of all parties, and brings the two branches of the fork into the same space, so that that there is a continuum of extensibility. We start by looking at the issues and problems with that arise when attempting to use XML and namespaces as the basis for the HTML5 fork.

XML Issues

So what are some of the issues with XML which drive the HTML5 fork away from becoming closer to the XML fork?

Issue	Motivation
It is a pain to have to add quotes around attributes	Ease of use
It is a pain to have to spell the entire tag in the end tag	Ease of use
Parsers must stop on error	unfriendly, impractical
Namespace URIs take too much space	impractical
Non-nested begin/end tags have to be accommodated	Legacy TAG soup

At the top are the ones which one could imagine being cured by a redesign of XML. To the bottom are the things which I would resist changing in HTML. In the middle are areas where one could imagine some compromise.

One fundamental difference of philosophy between the forks has been the attitude to deviations from the specification. In the past, people making Web pages have made many deviations from the specifications, so long as they worked. The result is a a legacy of Web pages which have all sorts of errors. It has been essential in the market for browsers that they work with these pages. The approach taken in HTML5 has been to document the behavior of these browsers, so that everyone knows what it is. The goals is that all old pages still work, but there can now be a well-defined algorithm and a test suite, instead of a heap of connected kludges implemented separately at great cost by each browser maker. This world is, then, very liberal, in what the Web page writers are allowed to do, and in what the client software has to accept.

The initial approach taken in the XHTML fork was very different: it was completely conservative. Recognizing that the situation which had arisen with legacy HTML was a big mess, XHTML started anew. A new content type was allocated for XHTML. The XML specification required that any processor deliver no results if the input was not well-formed XML. The idea was for XHTML to start a new branch of clean content which would eventually outgrow the old, and which would be a platform for much cleaner growth, with namespace-based extensibility, and addition of SVG, MathML, XForms, and enterprise-specific extensions in a well-defined way. Organizations and individuals who have adopted XHTML are often vocal in their praises for the benefits which they experience, but this has evidently not lead to any substantial inroads into the dominance of HTML in the general public web.

Robustness Principle

The Internet specifications, since RFC793, have been developed with guidance from a principle that one should be conservative in what one generates, but liberal in what one accepts. This is often a useful maxim, when writing a program to send or receive messages, and when there is a possible area of the spec open to interpretation. So one would send lines of limited length, but accept lines of any length; send always the same case as in the examples, but accept either upper or lower case, and so on.

This maxim works when two programs are communicating with short-lived messages, and when there is feedback between engineers when a system doesn't work. It has not worked so well on the Web, because the Web page designers in fact paid no heed to being conservative. They were not in general engineers who had read the spec at all, but random people copying each other's Web pages, and seeing what worked when they modified them. Further, the Web pages have a long, hopefully very long, lifetime. Once a Web page is out there with badly nested tags, it is out there for good. So on the web, there are some page creators who are no longer present, and others who are around and are open to feedback, new languages, and constructive feedback. Should the robustness principle be used or, if not , what?

Incentives

To look at a system which includes people, one must study the incentives for those people. Suppose there is, on the one hand (and on the X axis) a certain effort which a Web page author puts into the writing of a Web page, to eliminate various levels of error, and on the other hand (and on the Y axis) a reward given, in part, in terms of the quality of the rendered Web page on the range of clients perceived to be of interest.

In the case, shown above, of the conservative, XML fork browser, the page must be completely correct or nothing is rendered. The writer who has an almost perfect page is motivated to fix it, but the writer who has a page with several errors is not, as there will be no noticeable reward for incremental improvement. It is not very surprising that the majority of Web users whose pages would have started off near the left of the graph did not make it to the right when serving their code as XHTML.

Some errors we may consider hopeless even in HTML, in that no useful recovery seems possible for them. In the case of the liberal browser (above), the reward for a hopeless page is zero, but for a page with any other level of errors, it in fact is rendered completely by the browsers. Therefore, a writer whose page is hopeless is motivated to clean it up a little bit. But the writers of pages which have other levels of error are not motivated to clean them up at all.

So while the liberal and conservative forks have very different philosophies, they share one thing: They do not motivate the writer of a Web page to progressively improve their offering.

Bringing the fork together

The solution, as I see it, is to look at the motivating slope and fix it. When the user is provided with incremental rewards,

then he or she will move, hopefully, up the slope.

Motivating slope

What does this mean?

It means distinguishing more than the two possible outcomes of success and failure. We need to make a slope, so we need different levels.

It means recognizing all the errors as errors, but also allotting them an importance level, so that users can concentrate on fixing the more important ones, or perhaps the ones which give the best improvement per effort ratio.

There has been push-back against the idea of showing error indicators on Web pages, because no one wanted to be the browser to give the sub-optimal user experience. This can change in several ways. It can change because of user attitude changes. Al Gore points out repeatedly that we need to clean up the planet. People understand when we have to do some clean up. A browser which does not have these features would be seen as irresponsible in this context.

So step one is to have a tool bar which slides down when a page has errors, giving a rating to the page out of 100, and allowing drill-down by interested users. It is true that most users are not interested and are not able to do something about a random site they visit. However, they might still be interested in the fact that the site is not clean. People who buy a business may be interested in knowing whether that business pollutes the planet. Similarly, people may be interested in knowing whether the HTML that they publish or the sites that they visit are polluting the Web.

Another possibility is to allow users to specify which Web sites they are connected with. Anyone involved in the production of a Web site (up to the CEO and board for the company!) should be able to put that site into the list of sites for which they want more detailed feedback.

Changing the browser

We can also be smarter. We can make it so much easier for people to do the right thing.

The classic way the Web spreads is by the "View Source effect" . You like someone's Web page , you do a View Source operation in the browser, and then you copy it and paste it into your own Web page. This is the way Web technology has spread, and also the way of course all those problems have spread. Suppose, whenever I look at the source of a page, I see a cleaned up version? Suppose it is impossible (or very difficult) to actually see the original source without it being heavily marked with the places where it has syntactic errors. Suppose if I copy it in a clipboard, then I get the cleaned up version? Suppose this applies to "Save As" too? The code to clean up a Web page is not that big by today's standards. There are many implementations, Dave Raggett's tidy being a well-established one, also now as in Marc Gueury's HTML Validator Firefox Extension.

(One way to do it of course is to simply re-serialize the DOM tree of the page as loaded. This of course loses the formatting, which in general is a disadvantage, particularly when one needs to compare versisn of source files, or use source code control systems whcih do so).

It wouldn't have to be perfect. It would have to move a page substantially along the curve toward the clean end of the spectrum.

There are some things which browser manufactures could do right now, which could in fact change the ecosystem of developers and pages so that in a year or two a significant number of new pages were being produced cleanly, and in a few more years as the new content starts to dominant, the majority of the pages you see on the Web would be clean.

We are not talking here about a switch to application/text+xhtml, but continuing to use the MIME type text/html and progressively improving the content we produce that so that it becomes cleaner.

Would that be a good idea, and what exactly would it mean?

Well, in fact if everything was XML some people might regard it as actually less useful than the current HTML, when it comes to quotes around attributes. So this forces us to look at whether XML could actually change itself, to meet HTML between their current positions. So I would suggest that some of the things we would have put on the slope, some of the cleanliness goals, we simply remove and declare them non-goals. But to do that, we have to change XML.

Changing validators

It turns out a that the opinion of the W3C validator has a large amount of clout in the community. Specifications such as microformats and ARIA have been affected very much by what can be done without breaking validation. Now the validator to date has been a DTD-based validator, so it checks that the document conforms to a given grammar. It requires, to be happy with the page, a DTD declaration which specifies what grammar the author of the page thought the page was written with.

DTD validation will not allow the normal forms of XML extension, the addition of new elements and attributes. This is very ironic in a way. The "X" in "XML" is for "Extensible". The whole point is that an application written in XML can be extended by adding new element and attribute types. With namespaces, these elements and attributes become grounded in the Web, and URI space provides a way of avoiding any collision.

In this vision of a way forward, validators, or perhaps one should say page checkers, as validate is a word claimed by XML for DTD-validation, should give a grade to a page, judging it on several counts, at various levels. At the error level:

Content-Type wrong
Character encoding (if marked UTF-8 is it really UTF-8?)
Well-formedness: Bad nesting, missing end tag
HTML elements misplaced according to some kind of grammar

At a warning level:

Extension tags used with no namespace
Extension tags used, in a namespace without a namespace document

At an informative level:

Extension tags used with a namespace and namespace document
Extension tags defined in other W3C recommendations
Quotes missing from attribute values which do not contain spaces

In fact, it may be that the browser, now a computing platform of some power, is in fact the best development platform for a page check in the future. It is possible that the same code in fact could be deployed in a third party server checker harness as in a client-side checker.

Changing XML syntax

The arguments against changing XML are very strong. Its single great value is its single common specification, its stability. It isn't perfect but it is common across so many different applications. Attempts at create an XML1.1 failed, just trying to introduce a few new Unicode characters.

The arguments for changing it are that the alternative could be worse: It could be that the HTML5-style syntax with errors of all kinds being completely ignored propagates into first SVG then RSS then RDF and then SOAP. The entire stack has to be built so as to be able to do HTML5 error ignoring, with special knowledge that comes with that of various HTML tags. Even if you aren't using HTML itself, you just have to use that parser.

What would be changed? It would be recommended that parsers recover from errors where they can, and indicate all errors above a certain level of seriousness to the user.

Now everyone I have spoken to about this has their own list of things they would like to change in XML, if we were to do it, so deciding what goes in would be a interesting communal decision. Here is a list of some things which have come up. Some are better ideas than others, in my humble opinion.

Allow attribute quotes to be omitted for simple values.
Allow namespace to be implicit, given Content Type
Short-hand for switching from one namespace to another? (grounded in Namespace document)
Short form of close tag </> ?
Remove DTDs.
URIs for grammar, cross-schema links for mixed NS
Remove PIs (Have a xml:pi element if you like?)
Multiple root elements or mixed content as document?

(See Tim Bray 2002,Norm Walsh on XML2.0 in 2004, 2008),...

Lets go through these in order to clarify what we are talking about.

Optional Quoting of Attributes

The quoting of attribute values I have already mentioned. The quotes in SGML were not necessary. When SGML was simplified to XML, the quotes were made mandatory. this simplified the parser, but it complicated life for writers, and required more keystrokes, disk space and bandwidth. It also made the source more difficult to read by increasing clutter. It also made the source more difficult to read

Implied namespace

The implied namespace idea comes from a consideration of the follow-your-nose argument above. If an HTML document is delivered with a Content-Type which labels it as HTML, then why on earth does this information have, in XHTML, to be repeated in the document as the root namespace element? It is a waste of space and an imposition on the user. However, whether the page has an explicit namespace or not, I would like to be able to parse it and look for elements in the DOM using the XHTML namespace. So I would like all HTML elements to be deemed to be in the XHTML namespace. This is actually I think a sensible change to the architecture, that:

With XML-based content, the MIME type registry contains an implied namespace. for text/html, this is the XHTML namespace
The XML parser interface is extended to include an extra parameter, the implicit namespace

(Note that while this is a default for the namespace, as the term default namespace already means the namespace for elements with no prefix, we can't use it for this concept of the namespace for the default namespace when there is no namespace declaration.)

This would make SVG documents smaller as well, and who knows what else. It could be useful for cutting down the transmission time for small XHTML documents to mobile devices, and so on.

What about mixed documents? Well, the HTML mime type could be registered to that the implicit default namespace is XHTML, but also there is an implicit s: namespace for SVG and m: for math, and so on. A machine-readable list could be made centrally available, changed occasionally, an downloaded at install time (not run time, to save the servers!) in XML-2 parsers.

Switching namespaces

The fact that documents have a well defined meaning as grounded in the Web traces back to the terms being defined using URIs. This does not, however, mean that URIs have to be embedded in their full glory at every step. Namespaces already serve as an abbreviation system. Now we could add a sort or chaining within documents. For example, one could define an <svg> element in the HTML namespace which would have the implicit effect of switching namespaces to the SVG namespace.

I am not sure that this is a good idea, as it makes the amount of off-line information needed more complex, as one would have to have a way of specifying this in a schema language of some sort, and it would be impossible to parse the document correctly without that schema document. But it could be valuable avenue to explore if there continue push-back against namespaces.

Remove DTDs

The DTD syntax within XML is a historical artifact. It was part of SGML, used for defining grammars of SGML applications, but not itself using SGML syntax. The DTD language was kept in XML as at the time there was nothing to replace it. Since then, DTDs have been joined by XML Schema, Relax-NG, and other languages for specifying constraints for applications of XML. Meanwhile, DTDs have fallen behind in that they do not naturally accommodate namespaces. A large amount of infrastructure has been constructed around them in the XHTML fork's HTML Modularization spec.

The main reason for keeping DTDs in XML systems has been that they are needed for defining entities, and specifically character entities. The solution to this I would suggest is to define a namespace of tools to do this in XML. One could even take part of the xml: namespace.

This would also mean that one would lose the feature of default values for attributes and fixed values for attributes. These are a strange feature of the language in many ways. They make this unfortunate difference between a raw infoset and a post-validation infoset. They allow the DTD designer to say "even of you didn't put this attribute in, you still meant it". Of course the semantics of the application language can always be defined to have defaults, even when they are not provided by a DTD processing step.

There have of course been many discussion of this topic over the years.

Processing Instructions

Processing Instructions (PIs) are a strange corner of the XML specification which could be removed to advantage. PIs provide a form of "machine-readable comments" which sit between code (normal markup and text) and comments (which should be completely ignored in the application semantics).

Where one is tempted to use a PI, one should use a namespace to add an attribute for example to the root element. That allows one to have many levels of hint to different possible processors and interpreters about different things. After all, why have three levels when you can have n? (In fact, in RDF, I would often recommend that comments be left as rdfs:comment statements so that they are preserved in the processing and enlighten people reusing the data in a completely different contexts)

PIs are a kludge. The question of what is inside them, and what it means,

Close tag abbreviation

A commonly suggested shortcut, while we are discussing shortcuts, is to allow the closing tag </foo> to be given as </>. I understand this was a question of debate in the original XML design, and did not get in at the time. It is less self-documenting and less robust in the face of certain errors, but it can save a lot of space for enterprise applications where tag names can become very long. Also, for machine-generated code where operator error is not a problem and indenting can be done automatically, it clearly cuts down on the size of the file.

Multiple root elements, or mixed content as XML document*

A characteristic which XML does not currently have is the ability concatenate two valid XML documents and get a new big XML document. This property would have its uses. To make it possible, one could allow mixed content (A mixture of elements and text) a the outermost level. Advantages include that

It would be possible to transmit arbitrary XML content (for example from a selection, or the answer to a question on a form) as an XML document itself.
One could concatenate XML documents and ship them as one when the information to be transferred was the union of the information in each;
One could XML as a format in which the default was plain text, but markup could be allowed if necessary. For example, the title of a book, often plain text but sometimes needing a character entity, or a form field or e-mail in which the default might be plain text but occasionally one wants to add HTML emphasis.

What source could look like

The markup for a page which currently in XML will typically start like

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC
 "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <link href="../../../People/Berners-Lee/general.css"
  rel="stylesheet"  type="text/css" />
</head>
...

in the future when served as text/html could look like simply

<html>
<head>
 <link href=../../../People/Berners-Lee/general.css
  rel=stylesheet type=text/css />
</head>
...

and be considered perfectly valid XML2. It could be parsed by general XML2 applications, which would be passed the implicit namespace which would have come from a content-type lookup table. To HTML authors, the only non-HTML thing they ould have to do is remember the /> on the end of the link tag. So there is in te end a compromise between the forks, but one in which everyone can do most of what they want. So this may be a better place to be. Is it worth trying to get there?

Costs of change

Changes to XML syntax would of course be a vary major step. It would break a level of stability in the XML specification which has been one of its major advantges. it would potentially affect very large number of parsers.

On the other hand, it would only affect the parsers. The XML data model is not changed by the surface changes to the syntax. XML1 files would be valid XML2, so srializers woudl not have to change. Languages such as XPATH, XSLT and XQuery which are defined on the data model would not change. However, just changing XML parsers would be very dramatic step for the industry. It would leave behind many programs whos development has stopped.

But then again, if the alternative s that all systms have two parsers one for HTML5-like data and one for XML, that is a huge cost too.

What about the cost of change to browers?

Browsers currently have very many ways of treateing web pages, to adapt to different forms the language out there on the web. In one sense, a merge fork track would be another variation, one chosen to be more stable in the long term. The tricks to recognize particular types of old content will presumably be necessary into the future.

Changes to the browsers to bring them toward a common DOM for HTML and XML are also going to be significant. To a certain extent, perhaps one can allow the namespaced API calls to follow the XML+Namespaces model, but the non-namespaced calls to follow the HTML model. The complications here are to great to go into here. ^dw

Conclusion

Future developers will not only use the languages we define today, they will build on them to make new more sophisticated ones. The cleaner the systems we develop are, then the easier it will be. The HTML and SVG document models, for example, are powerful user interface libraries, and exciting novel new applications are being built on top of them. The difficulties inolved in dealing with the different APIs of different forks doesn't help.

We, the Web technology community at large, have a duty to lead the technology toward cleaner engineering solutions. While we should reatain an ability to read old web pages, we should move the community of producers (both hand-coders and authoring tools) so that the newly produced web ages become progressively cleaner.

To do that, we have to understand the motivations of website developers and browser writers and server administrators. We have to understand how changes to the software and the specifications can tweak the way people behave. We can also set new community goals and a new community attitude about unclean Web pages, so long as at the same time we move the goal of cleanliness to make it less irksome.

The direction outlines here involves quite a lot of work. It means developing new parsers, page checkers and browsers which encourage cleanliness. It means cleaning up authoring tools. It involves solving many intricate technical details of how these Web pages look to a script in the DOM. But the alternative -- the current forked track --will be a lot of work too. Keeping both forks maintained with separate diverging code stacks. Writing scripts which explicitly check whether they in an HTML5 or XHTML environment every few lines. Developing increasingly complex new extension methods for HTML5 to emulate namespaces. As the future unrolls, porting new deevlopments, like the <video> tag, within HTML5 to XHTML, and new developments, like RDFa, within XHTML to HTML. Or putting up with the burden of continual re-invention of new functionality in quite incompatible ways, on both sides of the stack.

We need to set ourselves goals of merging the forks, with some give on each side. We need to switch from strictly liberal and strictly conservative attitudes to one in which progressively cleaner pages are considered progressively better. We need to adopt an attitude that we are going to clean up the Web just as we sometimes need to clean a bedroom -- or a planet.

Grouchy Robustness Principle

Be conservative what you produce. Be liberal about what you accept but complain about any deviations from the spec in a way to help and to motivate the producers to adhere to it better.

Tim Berners-Lee

Original August 2008, made public May 2019

Footnotes etc

This is a deliverable of TAG issue 145.

This is $Id: HTML-XML.html,v 1.5 2019/05/20 21:36:35 timbl Exp $

1. (There is a parallel with adding them all to the XHTML namespace, but this is unfortunately not a precise one, because attributes without explicit namespaces are deemed in the NS spec to be in no namespace., rather than to be in the namespace of the element. The fact that the XHTML <a> element has an @href attribute, for example, does not mean that there is an attribute xhtml:href which one could consider mixing into other languages. Some, including me, regard this as a bug in the namespace specification.)

2. (Footnote: At the schema level there are issues too. There is not space to go into those here, but to oversimplify, DTDs are broken by design as they actually don't use XML syntax; XML schema got complicated; RelaxNG is a competing standard, but still needs NVDL to enable mixed namespaces. There is no simple way to say the fundamental statements for connecting two languages such as "An SVG circle can go anywhere an HTML IMG can go".)

3. (Thanks to Norm Walsh for the multiple root element suggestion)

DW: Perhaps the biggers single obstacle is the document.write() method which binds code and markup much too intimitely to all them to eveolve separately. It is too close to self-modifying code. Ironically, it is often used to use a compact declarative form (document.write("<p><a href="../"><b>here</b>?</a></p>")as an alternative to sequence of method calls to biuld up the same thing. This is easier to write, and easier to read. If it were compiled into a data object (see E4X) this would be clean coding. As it is, the intricaied of when document.write() inserts what into what stream end up defining huge amnounts of how code is written, and allow one to do all kkinds of non-obvious things.