Position paper on the W3C XML binary interchange

alex@research.canon.com.au


Motivation

Canon Information Systems Research Australia is actively involved with the W3C SVG working group. Part of our work has included development of the SVG Print recommendation. One of the major issues to be exposed during the development of this draft was the handling of binary data which is referenced by an SVG document.

In many cases, an SVG document will reference external files such as images. In such cases, an inline encoding such as 'base64', whilst adequate, introduces inefficiencies.

The efficiency impacts have been traditionally been seen as bandwidth problems, however, this is appears to be a simplification. The bandwidth problems appear to be secondary when compared with the encoding and decoding costs associated with 'base64' transmission of binary data.

Achieving an efficient method of encoding binary data within XML infosets is a worthwile goal in reducing the computational burden on packaged XML document transmission and interpretation.

Existing work

As part of the W3C SVG Working Group developing SVG technology, we have discussed the issues surrounding encoding of binary data and the interaction with XML documents.

One of the conclusions of our work is that the binary encoding of data is beyond the scope of the charter of the SVG Working Group, and as such, should not be addressed.

Our organisation is involved with other standards organisations. In some of those activities, we have proposed binary encoding schemes for XML which are of merit. We would be privileged to discusss these existing proposals for consideration with respect to a general XML encoding mechanism for binary transmisssion.

Major goals for binary XML encoding

Much of the discussion of binary encoding of XML revolves around bandwidth and its associated costs.

Our observation is that bandwidth is easily addressed. The problem of encode/decode time is, however, dominant. If the target device can be easily connected to the source device, there is no bandwidth concern. The use of 'base64' encoding, on the other hand introduces (what we believe) excessive overhead for the transmission of binary data.

The 'base64' encoding system was designed for internet mail transport, and one of the main design criteria was to allow the transmission of email attachments across any existing mail transport in existence at the time, these include X.400, bitnet, ARPAnet, etc..., These limited transport systems were extremely valid at the time of MIME development, but are largely irrelevant in the current context of XML documents being transmitted via 'http' gateways on the internet.

In light of the design criteria for 'base64' encoding and the MIME/internet email protocols, we feel that the text-only paradigm of XML documents requires a fresh look at the possibilities for bandwidth maximisation utilising more modern encoding techniques.

Documents studied

In our studies, we have looked at many XML based documents. These include SVG documents and other XML based imaging formats.

Our observations on SVG files which reference external 'JPEG' image files, is that the encode/decode time for referenced image data is significant when compared with overall render time for the image data itself.

A more efficient endcoding for pure binary image data would be useful.

Another observation is that in the attempted binary encoding of raw XML documents, a number of proposed schemes have been produced. We have proposed a binary XML encoding scheme which is efficient and is robust in the face of changing XML schema definitions.

There are existing XML encoding schemes which take advantage of the XML schema for a grammar to guide the encoding process to aid improved compression.

Some of these schemes suffer upward compatiblity difficulties if the XML schema is modified for future changes to the grammar.

We have a mature XML compression scheme that does not suffer from such limitations. This is one potential solution which could be considered for a general XML encoding mechanism.

Application areas

As our focus is digital imaging, application areas include a wide gamut encompassing digital photo albums for the end user all the way through to archival storage of electronic documents on an organisational level for corporate document management.

The application areas for binary XML are far-ranging, and as such, cover a wide scope of application areas.

Internationalisation considerations

As participants in other standards groups, we are well placed to understand the implications of any encoding schemes which have internationalisation considerations.

Our understanding is that most binary data can be encoded without any problems. When it comes to character data, it is important to understand the existing environment in computer systems.

Use of Unicode, character encodings, and associated concepts is fundamental to developing a usable solution to the problem of binary encoding of XML data.

Any encoding scheme we would propose would be a blind end-to-end transmission scheme that would preserve the orignal data. No loss of information should be tolerated, and as such, internationalization and accessibility would never be compromised.

Gzip, or not to gzip - that is the question

Our work with SVG indicates to us that 'gzip' is of no help when bundling SVG document with referenced image data, such as JPEG.

An encoding scheme which addresses the needs of advanced compression formats such as JPEG, JPEG2000, MPEG-4, etc is desirable.

An encoding scheme for XML is also desirable. We have observed that schemes other than 'gzip', can produce better encoding. Such schemes may be useful for consideration, if the tradeoffs appear favourable.

Despite the inherent problems in raw XML document encoding, we see referenced content encoding being a significant concern.

Raw XML compression

We have developed an encoding scheme for XML which is quite general.

One of the main features is that schema changes do not require any changes to the compression scheme.

Changes in the base XML grammar (via schema, RelaxNG, etc.) do not require any special treatment for encode or decode.

Our XML encoding work is purely general and provides the framework for encoding general XML without the need for special modifications in the case of schema update.

Random access, Streaming, Dynamic Update.

In proposing any binary encoding of a textual format, one should consider the greater body of research into compressed file systems, of which many papers have been produced.

In the XML case, it is important to consider the streaming case as one of high importance. In the web context, or any similar context, fast delivery of bandwidth limited documents seems to be the priority. People click away from a page that takes too long to load, and as such, streaming and progressive rendering is crucial.

Facilitating progressive rendering, via streaming support is essential.

It has been noted that interleaving image data with XML file data using techniques as espoused by RFC 3030 and RFC 3391 can be advantageous in a streamed delivery environment.

It may be useful to consider the design decisions made with regard to the forementioned RFC's to ascertain if they are of relevance to the more general question of encoding of XML data streams.