W3C Architecture Domain XML

Report From the W3C Workshop on Binary Interchange of XML Information Item Sets

24th, 25th and 26th September, 2003, Santa Clara, California, USA

  1. Introduction
  2. Conclusions
  3. Minutes
  4. Position Papers

Nearby: Call for participation

Introduction

Section 1.1 of the Extensible Markup Language (XML) gives as a design goal that Terseness in XML markup is of minimal importance. The Standard Generalized Markup Language (SGML), of which XML is a Profile, has a number of features intended to reduce typing when humans are entering markup directly, or to reduce file sizes, but these features were not included in XML.

The resulting XML specification gave us a highly regular language, but one that can use a considerable amount of bandwidth to transmit in any quantity. Furthermore, although parsing has been greatly simplified in terms of code complexity and run-time requirements, larger data streams necessarily entail greater I/O activity, and this can be significant in some applications.

There has been a steadily increasing demand to find ways to transmit pre-parsed XML documents and Schema-defined objects, in such a way that embedded, low-memory and/or low bandwidth devices can make use of an interoperable, accessible, internationalised, standard representation for structured information, yet without the overhead of parsing an XML text stream.

Multiple separate experimenters have reported significant savings in bandwidth, memory usage and CPU consumption using (for example) an ASN.1-based representation of XML documents. Others have claimed that gzip is adequate.

Advantages of a binary representation of a pre-parsed stream of Information Items (as defined by the XML Infoset) might include:

  1. It would not be restricted to a single schema or vocabulary, and hence could be interoperable between vocabularies;
  2. It would not be restricted to a single application or hardware device, and hence could be interoperable between implementations;
  3. Improved network efficiency and reduced storage needs: compression techniques that make use of domain-specific knowledge often do better than more generic compression;
  4. Sending pre-parsed data could reduce the complexity of applications, and may facilitate creation of simpler internal data structures.
  5. Web Services may need more efficiency, and a pre-parsed binary transmission format may help people to continue to work with Web Services rather than to explore proprietary interfaces.

One potential and very serious disadvantage is that one might lose the View Source Principle which has helped the Web to spread.

In September 2003, The W3C ran a Workshop, hosted by Sun Microsystems in Santa Clara, California, USA, to study methods to compress XML documents, comparing Infoset-level representations with other methods, in order to determine whether a W3C Working Group might be chartered to produce an interoperable specification for such a transmission format.

Conclusions

The Workshop concluded that the W3C should do further work in this area, but that the work should be of an investigative nature, gathering requirements and use cases, and prepare a cost/benefit analysis; only after such work could there be any consideration of whether it would be productive for W3C to attempt to define a format or method for non-textual interchange of XML.

See also Next Steps below for the conclusions as they were stated at the end of the Workshop.

Minutes

Wednesday 24th September, 2003

The scribe for recording comments, questions and discussion for the first day was Chris Lilley (W3C).

  1. David Orchard gave a presentation from BEA Systems.

    Anish Karmarkar, Oracle: if there is no single soluton that works for all cases, would you prefer one, or none, or multiple solutions?

    David Orchard, BEA: prefer zero to two or more

    Steve Williams, HPT: Clarify point about research needed

    David Orchard, BEA: Not inventing something new - lots of solutions out there. Carefully analyze the problem to be solved, and pick a good one. If no existing solution works, probably no new one would either.

  2. Rick Marshall could not be present at the Workshop. The Chair (Liam Quin, W3C) read Rick Marshall's paper [PDF version].

    Jim Trezzo, AgileDelta: Moore's law is fine, but does not apply to batteries.

    John Schneider, AgileDelta: Energy conservation does not follow it either

    John Schneider later expanded this as follows: Yes. Once we unplug a device from the wall, we have to remember the basic laws of physics. Work takes energy and every byte read takes work.

    Margaret Green, Ontonet: XML, or Infoset? Need to be clear what is being discussed

    Liam: Agreed that the workshop title is the infoset, and Rick's talk is about XML and does not consider the infoset but is primarily about the serialisation.

    Noah Mendelson, IBM: Infoset is not what results from parsing an xml document; it's only sometimes true. But there are also synthetic infosets, e.g. created via DOM and may be serialised later, but need not be. Essential to be clear on definitions.

    note a brief discussion period was set aside for people to discuss terminology and a definition of the Infoset.

  3. Michael Rys gave a Microsoft presentation.

    Steve Williams, HPT: Binary DOM [need not be] isomorphic with infoset

    Michael Rys, Microsoft: not a direct representation of the data model implied by the DOM; however you could support a DOM

    Louis Reich, NASA: Are the last three lines of your last slide out of scope for this workshop? Seem highly relevant to me. Problems currently outside XML scope might be brought into the fold by using a binxml solution (eg large binary blobs, etc). Which part of 80:20 are we looking at?

    Michael Rys, Microsoft: much of this is very domain specific, need some sort of packaging format

    Louis Reich, NASA: infoset can hold binary data. Our community would prefer a sub-optimal but standard way rather than our own, discipline-specific standard. A single W3C standard would be very exciting for us. Disagree with your assertion - people would indeed use this if it was a standard.

    Michael Rys, Microsoft: would need to extend the infoset to do this. Comparison with image compression, lots of special ways, most with encumbering IPR, most specific to particular type of content - would people abandon these to get a single binary representation? Or, better to use a packaging format and keep the image in its special, efficient format. It's a payload packaging problem.

    Larry Masinter, Adobe: Agree with much of the analysis, puzzled by conclusion about whether w3C should work on this. Clearly other areas where the solution was not clear (eg semantic web) so what is the threshold of research required. Not clear at all that standardisation work would increase fragmentation?

    Michael Rys, Microsoft: semantic web is research, it's not standards work. It's still too early, research is needed and should not be done at W3C as it stifles innovation once a standard is set. MS has internal binary representations, but we use textual XML for interop. It's all that works in all cases.

    [aside: the Chair pointed out that the W3C Semantic Web research is at least partly funded externally]

    Eduardo Pelegri-llopart, Sun: Interop in web services only gets to 80% using standards, the rest is reverse engineerring and non standardised extensions.

    John Schneider, Agile Delta: LZ or Huffman (frequency based) compression only works on large messages with high character redundancy. it does not work for high frequency streams of small messages, typical of mobile environments and Web services. Zip will often make these bigger instead of smaller.

    Also, I've heard many people expecting to see big improvements in user-visible performance. [supplied later by John Schneider: To be honest, it's not that clear that mobile users will see a noticeable speed increae given the high latency of mobile networks. The more significant benefits are economic. Carriers spend a great dela of money buying frequencies and putting up cell towers to increase capacity of their pipes. If the size of the data shrinks by 10 times, carriers can now fit 10 times more customers on the same pipe, meaning they can generate 10 times more revenue without huge infrastructure investments. For always-on packet-switched networks where users pay by the kilobyte, these savings are passed along to the customer.]

    [originally minutes text: Latency eats up the performance improvement, but reduced bandwidth still helps the carrier get more customers on that network.]

    Michael Rys, Microsoft: Carriers are not worried about interop with other xml apps. They use gateways and can send whatever they want between the gateway and the mobile device.

    John Schneider, AgileDelta: Mobile devices don't necessarily use gateways any more. They can now hit any URL and access enterprise infrastructure directly. If I hit a SOAP Web service using my mobile device, the payload comes back as raw XML, with no gateway inbetween. So mobile devices definitely need efficient access to XML everywhere, not just through gateways.

    Zero is not an option - MPEG7 and ASN1 is already in train. We need a general purpose standard that can deal with mixed namespaces.

    [John expanded this in email later, for clarification: Like others, I also prefer zero standards to two. Unfortunately, however, zero is not an option. There are already two mainstream standards organizations working on binary encodings for XML, MPEG7 and ASN1. Are they compatible? No. Is either one of them general purpose enough to handle all mainstream XML applications? No. For example, neither one can handle mixed content, which is required for XHTML -- a pretty popular use case. We need a general purpose standard that can deal with the broad uses of XML.]

    David Orchard, BEA: Our position is not "no"; it's "be sure what you are designing" and "pick one". My question to Microsoft (who said there is no 80:20 point now): Do you think there will be an 80:20 point in the future?

    Michael Rys, Microsoft: Yes we would reconsider then, but finding such a point is hard. MS for example have been trying since 1998 and not found it yet that works across the whole company.

  4. [luncheon]
  5. Robin Berjon gave the Expway presentation [this is a ZIP archive of SVG].

    Noah Mendelson, IBM: Since you send compressed schemas, could you get benefits by sending the schemas for Schemas?

    Robin Berjon, Expway: that schema is not valid; we had to hack it

    Noah Mendelson, IBM: you could send the schema on a one shot basis - does this recurse?

    no, magical types need special processing. it does work, but needs normalisation. Proper solution is to do a mapping to the common schema model, then send that

    Noah Mendelson, IBM: ok so you do that with a 10-20k schema, what overhead to send it with each message?

    Robin Berjon, Expway: with each message? and depends on the richness of the schema. With each message, only send the schema parts that are actually used that time. Or send a schema with a set of messages. We can also send incremental schema updates.

    John Schneider, AgileDelta: well done! You dealt with many of the problems that we encountered. About representing any general infoset - MPEG7 currently does not support mixed content models, namespace prefixes, and some other infoset items.

    Robin Berjon, Expway: Namespace prefixes are supposed to be disposable

    John Schneider, AgileDelta: needs to apply to all uses of xml, and hit the mass market rather than high cost niche solutions.

    [by email, John expanded this: Actually, prefixes are part of the infoset and while many people don't care about preserving prefixes, there are some communities that require it. Our solution needs to apply to all uses of xml, and hit the mass market rather than high cost niche solutions]

    Robin Berjon, Expway: BiM 1 does not support mixed content, BiM 2 does and so does our product. PI is also doable, comments could be added too.

    John Schneider, AgileDelta: Is open content (unexpected content, eg) supported?

    Robin Berjon, Expway: yes. Generic Infoset encoding. You lose some of the benefits, but not all of them, and it still works.

    Santiago Pericas-Geertsen, Sun: BiM has nice features like fragments, how much of that should be part of the format and how much left to other levels?

    Robin Berjon, Expway: need to be sure the low level format is fragmentable, to let higher levels do it. Fragments need context like inscope ns declarations. BiM is for broadcast so it was designed to do that.

  6. Oliver Goldman gave the Adobe Systems Inc. presentation.

    (during the presentation, someone noted that base64 is 30% biggger - but if you need the clean form of base64, that is 133% bigger. Also questions of random access.)

    Louis Reich, NASA: PSVI: what is the problem exactly, 1 to 2 years down the road?

    Oliver Goldman, Adobe: some people use Schema, but many do not or have non-validated data, so it's important to work with that

    Robin Berjon, Expway: using the XQuery Data Model, both PSVI and the ordinary Infoset can be represented

    Oliver Goldman, Adobe: not sure, need to look at that more. Could I round trip through that?

    Robin Berjon, Expway: yes

    Mike Conner, IBM: problem with off-line access, schema not found - let's not do that

    Oliver Goldman, Adobe: same issue for form data defined by a schema

    John Schneider, AgileDelta: is that schema in the pdf file now, to show the form?

    Larry Masinter, Adobe: no, it just has the form data and the presentation

  7. Stephen Williams gave his position paper.

    Microsoft: having the wire form and the internal form the same is very constraining on the applications choice of internal form

    sw: yes it needs to be portable, not just a memory dump

    Don Brutzman, Web3D: Streaming over the wire - network byte order solves that

    unidentified/inaudible: (question about efficiency)

    sw: needs to be little overhead from getting the data off the wire and starting using it. Currently [with text/xml interchange] a whole lot happens, object creation, lots of moves of small amounts of data, pointer creation.

    Erik Wilde, ETH Zurich: parsing is expensive, you benefit from locality it seems, so how expensive are things like namespace information which is not necessarily local

    Steve Williams: I am not compiling, it's not an issue

    John Schneider, AgileDelta: some applications want to preserve namespace prefixes, so those applications are upset by namespace prefix redefinition

    Noah Mendelson, IBM: Whole identity of in-memory and on-the-wire is at cross purposes to why many of us came to XML - so parties don't have to agree on their APIs and internal models, many of which are preexisting and already deployed. We tried to do this with DCOM and CORBA; now we are using XML because it decouples needing to care about all that byte-level pointer stuff. Your API still has elements and attributes?

    Steve Williams: yes, it's like nested objects, most 3GL object oriented languages do this. Can choose to have overhead by mapping to native objects etc at some efficiency cost. Worst case is no worse than best case now, best case is a lot better.

    Larry Masinter, Adobe: two types of pushback:

    1. all the other parts are more extensive than parsing
    2. reducing parsing is not enough because of swapping byte order or other processing;

    these are contradictory, swapping two bytes is a lot less expensive than parsing a document.

    Need to scope the applicability of binary xml to those cases where parsing *is* a significant portion of the work.

  8. (break)
  9. Noah Mendelson gave IBM's position paper, which was accompanied by a presentation [PDF] by Mike Conner (also of IBM) on CBXML.

    Eduardo Pelegri-llopart, Sun: Throughput - how does it compare to processing character form

    Mike Conner, IBM: no, because parsing technology is making big leaps forward and production parsers are much faster than stock, free parsers. Code quality and maturity is a major determining factor. [network costs were not significant - 6% - extensive tests. Measuring instructions per character processed]

    You can't assume things that the schema has not told you. Table based processing has more predictability and improves performance. Avoid schema-dependent processing (not encoding, but processing).

    Doing a lot of UTF-8 to UTF-16 conversions (eg in SAX-based tree traversal) is very slow.

    Larry Masinter, Adobe: Examples are all message passing, but XML used in many other cases. Is a CD-ROM full of XML "a message"?

    Noah Mendelson, IBM: a 30% increase is an insignificant reason for standardising, but could be significant for terabyte range data. Not convinced that random access adressing is tightly bound to binfoset.

    Larry Masinter, Adobe: Not appropriate to look at the considerations independently.

    Michael Rys, MS: Random access and compression do relate.

  10. Santiago Pericas-Geertsen and Eduardo Pelegri-Llopart gave the Sun Microsystems position paper.

    Note: useful link to X.694 W3C XML Schema to ASN.1 mapping linked from www.itu.int/ITU-T/asn1/database/itu-t/x/x694/2003/

    Noah: Is it using Java reflection to do the classes?

    Santiago Pericas-Geertsen, Sun: No.

    ms: Protocol encoding is highly based on schema implementation. How does it handle open content

    Santiago Pericas-Geertsen, Sun: It can be handled, the holes do not have full performance; they can be mixed ok.

    ms: it's a very message-oriented architecture. So you want one standard for the WS application area, even if it's unsuitable for other areas? This will produce fragmentation

    Santiago Pericas-Geertsen, Sun: But WS are only interested in interop with other WS?

    (several): No!

    ms: MS and BEA and etc have interop with textual XML, doing this will bifurcate web services

    Eduardo Pelegri-Llopart, Sun: This is coming from customer pressure, there is a real problem to solve.

    Mark Nottingham: Could want to do XML signing, encryption etc and for that it needs to know the element and attribute names, etc. Variability in performance with fallback is less desirable than uniform performance.

    DO BEA: So this is applicable to a part of WS, not even all of it, and you would rather do this than nothing?

    Santiago Pericas-Geertsen, Sun: Right.

    Eduardo Pelegri-Llopart, Sun: This addresses the needs of our customers, and they would rather do this with a standard. But they need it anyway.

    DO BEA: So if everyone did that, BEA did it and IBM did it, are you prepared to live with someone elses standard?

    Eduardo Pelegri-Llopart, Sun: We are not committed to this particular solution.

    NM: (Looking at the slide with two multicolored bars - Performance Results, time spent in layers) So, there are four messages in that, roughly a gigahertz processor - we are seeing better numbers than that for textual XML in our applications. You are saying it's taking a million instructions to do the one message? 2000 instructions per character? Your left bar is too big by a factor of 10 or so.

    Santiago Pericas-Geertsen, Sun: James Clark wrote about this on xml-dev, removing the whole network layer (scribe missed the point here)

    steveibm: there are two optimisations, using the schema in the processing does get you better performance here

    epl: Have to do a binding to do something, but it's optimised to transfer the objects. Removes the databinding level.sw: We cannot compare these without a standard corpus of test files that are run on different implementations.

    steveibm: rmi slows way down when there are deeply nested structures

    ad: did you disable nagles algorithm, because of throttling effects

    spg: no, we did not

    Larry Masinter (Adobe): Goal should be to improve the average over likely implementations, not to finely tune one implementation to the max - it would be just one sample point, and not address the wide variety of use cases. As far as creating a working group in W3C, do we need consensus on requirements to form a working group? Or can we find a group who is interested enough in a subset of requirements, and that everyone else agrees to leave them alone to do that work?

    (several) laughs

    Liam: no, this group is not expected to come to consensus on requirements. looking for rough consensus on desirability of further work in this area. First step would be a requirements document, probably only that, and in a fixed time period. Need to know whether it's generic or if it's industry sector specific.

    Larry Masinter (Adobe): so it might not happen in xml core

    Liam: no, xml core has not expressed interest in that area, a new group would need people experienced in that area and also we don't want to take time from XML Core's existing workload.

discussion

Thursday 25th September

  1. Introduction

  2. Break-out Groups to discuss remaining papers

  3. noon: luncheon

  4. First reports from break-out groups

  5. Break-out Groups to discuss requirements

  6. Summaries of Break-Out Work

    Requirements

    These are not in any particular order. During the meeting we tried to merge ones that were obviously the same, but we did not work on consolidating them.

    1. Maintain unversal interoperability - all tools, all documents.

    2. Continue to work with existing parsers and tools?

    3. Do not want a domain-specific solution [e.g. wireless]

    4. Efficient storage is important [compactness]

    5. Efficient transmission is important

    6. Support both storage & messaging representations

    7. Want fast decompression, even if it means slower compression

    8. 10x faster than 2003 best practice with textual XML

    9. Support parsing on low power device

    10. Must reduce processing time (compared with parsing) including data binding time

    11. Want to be able to create binary thingies directly, not just via pointy brackets

    12. Performance comparable to (or better than) RMI

    13. Must support packaging - bundle multiple files together, maybe with nesting

    14. Want to be able to send deltas [updates, e.g. arbitrary sets of changes]

      e.g. with versioned fragments

    15. Must support random access based on infoset (XPath) or other boundaries, e.g. image, page, indexed; indexes built from stream

    16. should be able to update in place based on Xpath

    17. Fragments - be able to start reading at fragment boundaries

      [e.g. repeated sub-elements for repetetive broadcast]

      interchange fragment context (e.g. location in document tree)

    18. Must progressive downloading [e.g. progressive rendering]

      look at header before seen whole thing

    19. Support streaming

    20. Want progressive generation (e.g. no packet-length at start)

    21. Want XML Fragment Interchange, e.g. for query results & document subsets

    22. Transmitted fomat shall be easy to convert to/from native XML Schema data types

    23. Must support XML Security, e.g. via canonical XML & reconstruction thereof

    24. Want arbitrary precision numerical data formats

    25. Want directory at the start, e.g. for allocation

    26. Negotiation: fall back to text format if receiver can't understand binary

    27. Must be able to distinguish text XML from binary format on insepction

    28. Must be clear about MIME media type to be used

    29. Must not rely on HTTP [e.g. file support]

    30. Support one-way communication

    31. Must work on assymetry in bandwidth

    32. Can use schemas to help encoding

      • Must support schema version detection, multiple schemas at once

      • Robustness in face of mismatched schemas [e.g. reader makes right]

      • May support download of new schemas, or some other way of schema evolution

        • Not all devices will accept new schemas:

        • support read-only schemas for recipient

      • Must support open content, e.g. elements, values, subtrees not in the schema

      • also want to be able to modify understood items & pass on (forward) subtrees, including elements you didn't understand

    33. Self-describing format

    34. Solution must be mass market, commodity price points

    35. Must support fast access to individual parts of the infoset, [e.g. for headers] mixed mode encoding, some parts compressed

    36. Must support custom & multiple compression schemes for data types

    37. Want to continue to work with SAX and DOM, Pull parsing, e.g. existing APIs

    38. Must require minimal changes to application layer

    39. Should support validation on receiving

    40. Must approach efficiency of hand-coded (binary) formats [as per information theory]

    41. Must work for both data and documents

    42. May be willing to have lossy compression, infoset subset

      e.g. might be able to lose comments, processing instructions, whitespace (some applications may be able to lose some kinds of content, e.g. precision on SVG coordinates)

    43. Want round-tripability wrt. canonical XML

    44. Must be able to represent complete infoset

    45. Should support arbitrary extensions to infoset

    46. Consider efficient support for other data structures, e.g. linked list, directed graph, e.g. id/idref optimization to pointers, point to any character, xpointer or something

    47. Consider option to propagate inherited data (e.g. xml:lang, xmlns:) down

    48. Must work equally well on high-frequency stream of small messages that may grow bigger if you use gzip on them; also large files with many small objects, e.g. a million floats

    49. Must minimize bandwith for both small and large messages

    50. Must be easy to implement

    51. Want to specify the order of serialization... e.g. may want to send header up front, or may want leaf nodes first and may need to inform receiving end

    Use cases & comments

    More cycles available for compression than decompression

    SOAP important

    cell phones have high latency

    memory & processor constrained but not power constrained

    satellites, large images coming over low bandwidth (and high latency?) and must be archived for between 5 & 50 years

    to explore - relationship between storage & message

    bandwidth not major concern, memory footprint in the box (CPU), is.

    extensibility of Infoset

    exist military bandwidth requirements of 300 (effective) baud

    Exchange live infosets, e.g. share in-memory representation?

    .i.e. use the binary format in a database

    Access XML-based data from mobile device, wireless link

    Store XML in db

    Program passes XML to another program

    pass live infoset between programs

    display SVG on cell phone, PDA

    SOAP messaging over wireless (e.g. cell phones)

    e.g. customers' data, i.e. not fixed payload

    deliver XMl to low powered devices over wireless or other low bandwidth connection

    Store infoset in mqml [?] system

    Send infoset between devices on P2P network

    Route SOAP messages

    XQuery across multiple nodes

    Access any page in large document

    signing documents

    Editing a portion of a document

    Storing large documents

    Annotating a document

  7. adjournment (end of day two)

Friday 26th September: Plenary

Do we go ahead? If so, what needs to be done? Use cases, test data, benchmarks?

Minutes for Friday were taken by Anish Karmarkar (assisted by Don Brutzman)

Should W3C do more work in this area?

Do we go ahead? If so, what needs to be done? Use cases, test data, benchmarks?

Jonathan Marsh (Microsoft) asked if we could have a discussion about people's view of the future, and the ramifications of not going ahead.

Jonathan: I don't have a good understanding as to what each of you want out of this workshop. I definitely want this to be part of XML. Products should be able to consume both formats, across the internet. Another option is that there are certain industries where there is a gateway between the internet and the private community. There could be common gateways which understand a common format. A third view is fast Web services.

Jim Trezzo: There could a phased adoption. It could start out with isolated communities and then may end up with wider adoption. Ultimate goal remains total adoption.

Eduardo: Total adoption may never happen.

John Schneider: what people want to achieve is the economic benefits and interoperability benefits of XML to areas which cannot benefit from XML right now. Eg. wireless. I would expect that these areas (where the pain is the most) would be the first adopters. Once this is available, other areas would adopt is as well. Everyone is not going to switch in one day. Migration is going to happen.

[We need a migration plan. This is not as difficult as it might sound. I've doen it in other similar domains.]

Michael (IBM): one of the criterion for getting this right, is to have the right tools, where I don't care whether on the disk it is binary or text. We do need to factor in (for transition) -- if the binary XML is a solution for say footprint reduction, then such a solution requires that text xml not be supported.

John Schneider: you need to have some tool to read the binary format for human consumption. Unicode and ASCII does have lot more tools right now then binary. I don't want to lose the ability to read XML and put it in specs, to write human-readable examples on a whiteboard so people can read it, or put XML examples in books. We will always need the text XML encoding for human accessibility.

Prof. Kimmo Raatkainen: XML will be used more and more for machine-to-machine interaction. We should look at trends on CPU speeds, memory etc are evolving. They don't evolve in the same way.

Robin: footprint problem - those devices would only support binary. I don't see that as a problem.

Show of hands: who has edited using non-xml tools XML. response: overwhelming.

Anish: Johnathan -- in the options that you outlined, are you assuming that we will be able to come up with a single binary format that will satisfy everyone? It is not clear to me that we would indeed be able to do that.

Glen Adams: I am more in the gated communited who would not be (possibly) connected to the internet.

Stephan Williams: perhaps a standard framework is a way to go

Don Brutzman: binary xml is not just for wireless. benefits of binary xml should not be restricted to just small areas such as wireless.

Craig Bruce: if u take a text file and gzip it, then you have a binary file.

MarkN: but one has to go through that tool before editing it.

Robin: but gzip is lossless.

Noah: we should not fudge the different requirements which are all over the place and conflicting. We have to be careful about finding a sweet spot. Competing requirements.

MarkN: if we want ubiquitous support, we may have to compromise on other properties. What are the trade-offs?

Liam: discussion on trade-offs is a discussion on core requirements.

Eduargo: schema v. no-schema - all schema proposals have to deal with the no-schema case as well.

Liam: a related question is, what if we don't go ahead? what are the options if the verdict of the W3C is no. It seems to me that we can draw a list of options -

  1. View of the future

    Jonathan Marsh of Microsoft

    1. would any "XML" product be able to consume/generate both formats?
    2. gateway between Internet and "gated community" of binary interchange
    3. some part of the Internet with negotiated fast format (e.g. WS)
    4. phased adoption from (b) to (a)?
    5. over time, binary interchange form replaces text form
    6. binary for machine/machine communication, text for humans
  2. what if we don't go ahead?

    1. everyone uses text only

    2. many binary formats that don't interoperate, fragmentation

      • but they may be standards themselves, e.g. ISO/ITU
      • W3C would lose control
    3. slow convergence on non-XML binary thingie
    4. some applications wait for faster CPUs, use own binary until then
    5. binary formats outside w3c not in sync with w3c specs, things you can't do, ?maybe no semantic web?
    6. someone else produces a single binary standard (e.g. ISO, de facto), and maybe W3C XML suffers?
  3. what if we go ahead?

    1. by the time it's a REC it's not needed so much 'cos computers are faster but: O(1) is always <= O(n), 10x smaller is always 10x smaller (random access, compactness)
    2. repeat of WAP: marketplace wants XML
    3. maybe we discover it's not all that better after all (success)
    4. tools may emerge to help people measure their needs + compare solutions
    5. roll-on effect to other XML specs, APIs, etc.
    6. everyone uses it, widespread adoption, increased interoperability
    7. XML-based movies: i.e. new users of XML, new communities
    8. reduces cost of using/deplying XML (bandwidth)
    9. clarification of requirements, e.g fragmenting vs compression
    10. possibly can't meet all requirements, so may have multiple specs
    11. might end up making a significant improvement in development and maintenance of applications

Discussion...

Noah: a possible answer to tower of Babel. binary xml may converge, but it is not xml, may be compatible to XML.

MarkN: usecase and applications that don't find text xml useful, don't use it till parsers and tools get efficient.

Liam: we have more and more layers, and with time things don't necessarily get faster (eg: 1989 a window creation was faster then now).

Eduardo: there already exists ISO and other standards. New future scenario is - we would really like to see that existing efforts should not get hosed by new standards from W3C. If we have a binary standards (even if it isn't from W3C), we will have dissonance. Everybody loses then.

Alessandro Triglia: there is a set of ISO/IT-UT standards already available. I want to stress that there are standards that are available now. If w3c does not do anything, they can already use those standards. They are final committee drafts.

Glen: a corallary to (b) is W3C will effectively lose control.

Selim Balcisoy: the option (b) is important to us. If we don't have a successful binary format then we lose the advantage of text xml as well. You may lose a substantial part of the web.

Liam: as soon as you have end-to-end non-xml stuff going on. Then you end the story.

Selim: we do support xml right now. but the question is how long can we do that. There is a heavy burden on the devices. This is half of the web. What happens if we don't have a single binary implementation. There is a standard for mobile device - 3GPP. But they are looking to w3c to do this.

Don: if we say half of the data on the internet is non-structured (from a xml point of view). Then all the xml tools/standards (from XSLT up to the semantic web) cannot operated on the data and we have partitioning.

Liam: maybe binary xml format would make sense to people who want to store movies in XML. We are fragmenting it on one hand and on the other hand, we are expanding it.

[the background here is that someone in the film industry had approached Liam asking about representing entire movies in XML. This might become feasible if there were some more compact and efficient representation than textual XML.]

MarkN: spend some time capturing the risks of doing this.

Craig: wrt to representing movies in XML, with my open source impl, I get good compaction and performance. A binary rep of a movie can be embeded in a XML document.

MarkN: what is the purpose of this, you cannot use xpath for example on it.

Liam: embeding is a good usecase.

Robin: the svg WG is dealing with this as well. If we have xml representation of raster images then we can use XML tools.

Liam: we seem to have two groups. one who says this makes great sense, whereas the other group says this is crazy.

Eduardo: if a binary format came from one vendor then the other vendors suffer.

Noah: i would like to respond to what MarkN said - what could go wrong if we go ahead? I would like to reinforce that we are trying to chase the technology curve, eg: the cell phones would be really fast or not. It is worth looking at, but no decidable. I don't buy that over time things get worse (Liam's example). But I don't think things like this happen in this area. Look at the ethernet standards. One of the risks is that we won't need it as badly as we think we do. This is what happen with non TCP/IP. This is a risk.

Santiago: the approach -- wait for Moore's law to kick in. But that is not the best way to go, 'cause we may need this for other devices such as Java Card.

Arnaud Le Hors: look at the past to predict the future. We were looking at getting HTML to cell phones, people did not think that would happen, but it has. Selim Balcisoy: bandwidth does not follow Moore's law. Nor does battery life. We will not have one brower, but multiple applications. So there will be a constant need to have more (for wireless industry), binary format is one way to get there.

John: wrt to Moore's law - +1 to Selim's comment. Expectations - when I sit in front of my desktop, I have certain expectations. For mobile devices it's different. I will always want to squeeze more out of my mobile device to come closer to matching the capabilities of my desktop machine. Also, don.t forget the tragedy of the commons . as more computing power becomes freely available, our computing needs rapidly expand to use the available power.

David Orchard (BEA): 2 points. Let us assume that we go thru a benchmarking exercise and come out that it does not make that much of a difference in most of the cases. This might be perception issue. My point is: if we have a benchmark that says that it is not a big deal, we can convince the world that it is not necessary. Second point, customers say we need faster xml. They are going to do something anyway (wrt to faster xml). They don't necessarily make the best choices. One adv. of having a single spec (if we it is not the best solution), at least it is "one way" and a std. and economies of scale may kick in.

MarkN: a tool that provide performance number may be useful for comparison. Some of the solutions actually add more functionality which does not have anything to do with binary encoding. There might be a need to have a reworking of XML specs/APIs.

Oliver: 3a. is a dangerous agruement. Compactness is considered important, that can be address by technology getting faster. But as machines get faster, people produce bigger and bigger documents. There are certain requirements that do need to be addressed and cannot assume that technology will take care of it.

David Orchard (BEA): there are certain aspects that are not addressed by Moore's law such as random access to documents. Non-linear improvements is worth focusing on.

Noah: XML is tree-based and not sure if the community needs random access. we need to think about this. Some improvements such as random access may affect how we consider using XML not jsut binary XML.

Liam: one way to do this is to look at what happens in DB. For example, there isn't a common table format but a common query language. This may be applicable to random access.

Liam: we can do something more or better. And we could do something that we could not do it before. There is a difference.

Eduardo: We should separate things like binary XML and fragmentation (which are non-binary xml requirements)

Johnathan: there are a lot of problems such as reducing the XML brand or interop. But one thing we need to consider that we are not going to satisfy all the requirements, so we are going to disappoint someone and they will come up with their format anyway.

break

Liam: we need to get into the discussion of where do we go from here/next step?

Liam: anyone here thinks there should not be any further discussion?

(no one raised their hands)

Liam: we should have a forum for this dicussion. I would like to propose that we create a new w3c list for such a discussion.

ACTION: Liam to create a public W3C mailing list for discussion

John: should we document the objectives and deliverables for the list?

Noah: lot of people are saying - can w3c do some really serious work in this area. Beyond discussing this, how do you write a charter for the WG? There is a debate as to whether anything should be chartered at all. Part of writing the charter is to figure out if it is a bad idea. Perhaps a task force is the way to go. The debate will surely happen in the AC. I am not happy with the idea that we should have a WG to figure out if we should have a WG. This will result in a lot of delay. Or saying that there is disagreement and the AC will do what it does. We could find a process for deciding and this requires more than just a ML. We need someone from W3C listening in, figuring out where this is going.

David Orchard (BEA): we need to do to decide is yes/no questions for a WG is: it is highly likely that we will need to do some kind of benchmarking. There is a push back on this because of the time required to do this. It is more important on what the next deliverable is rather than the mechanism or doing this.

Noah: one other aspect is that we have to involve a broader range of people then what are involved here in this room.

Eduardo: I don't think we should do benchmark. Maybe micro-benchmarks or experiments. So as to convince people. It would be good to focus on the deliverables rather than the mechanism, but the mechanism is tied to the deliverables.

Jim Trezzo: the core people don't want to spend all their time on arguing about this with the world. All we need to show feasibility rather than benchmarks. We need to demonstrate that it is not a research project. And we want to do this really fast.

Liam: we could actually have a second event, where we discuss the findings. Similar to a interop workshop.

JimT: I don't want to take this too far. This may get too competitive.

David Orchard (BEA): various companies have benchmarks but they are not comparable (as they are not normalized).

MarkN: a mailing list is a horrible place to discuss a charter. As this requires some hard decisions. We should document the state of the community, usecases for and usecases that don't need them.

Nokia: I have concerns with benchmark. Benchmark does not deal with solutions that may be in the future. It should be only for information purposes. I would like to see some deadlines. Public email list is not the best way.

Liam: I don't think we mean to choose the one with the fastest benchmark. It is for demostration purpose and for comparison.

Margaret Green: There are other avenues then email - blogging/webspace/Wiki. There are other alternatives to email.

Liam: mailing list is good to have everyone's opinions heard and is a possibly good process to bring something to the attention of the AC. I don't want to get bogged down with the mechanism. Archives are helpful for mailing list.

Margaret: as far as measurement/benchmark goes, it should included a range of devices and domains. We should be able to demonstrate boundaries of what can be solved.

Stephen: we should use Wiki for collaboration. Measurements should ack that there will be some discussion on possible approaches.

Liam: some of the W3C IGs have had ML, where people were asked to stick to a particular topic for a period of time and then move to a different topic after sometime.

Don: big picture objective - near term - IG/WG so that we have a process to guide us. In the long term, I would suggest a goal. If we consider XML to be structured data, then the over-arching goal is - would XML be text+binary or would it be just text?

John: about benchmarks - we should compare apples-to-apples. The danger is that if I have a super-fast parser and compare it with binary XML built on the xerces/dom we don't have a meaningul comparison. Comparing numbers from different vendors doesn't really mean anything.

Liam: it is entirely possible that the act of saying - we need less bandwidth, memory etc. This may prompt people to improve their parser. There are DOM implementations that use two orders of magnitude more memory than others. Clearly one can push back on boundaries.

John: the point I am trying to make is that it is really hard to compare benchmarks without a controlled environment. This is not just limited to the hardware but also the software that is used.

Laamanen Hiemo: measurement on top of different systems are very different (GPRS etc).

Liam: quite right. we need to have a set of data. We are not trying to benchmark impl. but to see if text (or binary) does better.

David Orchard (BEA): if we don't normalize the data sets, env. The AC/public is going to poke holes in the benchmarks.

Liam: this will also result in challenging people to do better with text xml.

Stephen: test framework for various systems (wireless/wired etc) would be useful. There is some value in looking at what it would take in terms of the programs.

MarkN: all these things (benchmarks, test framework, measurements) require a WG to come to a consensus on. An IG is not going to come to a consensus.

Liam: WG - there are IPR issues, commitment requirements.

Arnaud: Typically WG have deliverables that are rec track. Participation requirement are much harder, attendence, standing etc. There are commitments.

Liam: I should point out that not all WG have rec deliverables.

MarkN: WG has IPR requirements and that gives me more comfort.

Don: people who care about this should read the process docs. I agree that WG is more appropriate. Perhaps we can start with a IG and progress to a WG. Goal is WG, fallback is IG for continued deliberate work towards WG. ML alone is a waste of time.

Dmitry: we need to decide on two deadlines: for the IG/ML and another for the WG/rec/standards. The IG/ML deadline should be short.

David Orchard (BEA): IPR disclosure are required for WG and not for IG.

Liam: in principle we can invent something like a RF IG. Or make it a task force under the auspices of a exiting WG.

David Orchard (BEA): the reason WG makes sense, is that a commitment requirement maps to a WG. Having said that, if IBM could not live with a WG but will live with IG, then we need to be flexible to be inclusive.

Arnaud: we have moved the dicussion about whether to discussion about an IG to how to form a WG. But we don't know what the charter is. I don't think this work justifies a charter. We need a dicussion on that.

Liam: if we came up with a charter (assume that for a minute) for binary XML, how many people would participate (this requires commitment).

some hands raised

Liam: how many people would not participate in a WG but in a ML/IG/etc?

a very few hands were raised

Liam: anyone would oppose a WG creation? Or cannot live with the question?

Johnathan: there are a lot of voices that have not been heard that are outside this room. this is not an easy question to answer.

no hands raised

Liam: are there people here who think w3c should not do further work.

no hands raised

Liam: consensus that w3c should do further work in this area.

Kimmo Raatikainen: can we discuss the deadline for a draft charter before we break.

Liam: perhaps before the AC meeting in Nov or May. May seems too far to me. It makes sense to talk about the AC meeting

Workshop Adjourned.

Next Steps

  1. Forum for further discussion

    ACTION: Liam will create a public W3C mailing list for discussion

  2. We need either a WG or IG to:

    • approve / agree on a pssible WG charter, or deciding it's a bad idea
    • discuss requirements
    • find set of use cases
    • publish comparisons/measurements on shared test data on range of devices and domains, demonstrate boundaries of what can be solved
    • Take the Taste Challenge: compare with text approaches
    • maybe discussion on possible approaches
    • need deadlines
  3. consider interoperability workshop to compare measurements
  4. consider Wiki for collaborative editing of requirements

Volunteers for drafting a possible WG charter for Nov (to figure out what to do with binary XML): Mark Nottingham (BEA), Robin Berjon (Expway), Stephen Williams, Santiago Pericas-Geertsen (Sun), Selim Balcisoy (Nokia), Kimmo Raatikainen (University of Helsinki), John Schneider (AgileDelta), Alex Danilo (Canon), Don Brutzman (Web3D Consortium), Mike Cokus (Mitre)

Position Papers

W3C Team

Representative: Liam Quin

Representative: Chris Lilley

Representative: Philippe Le Hegaret

Sun

Representative: Eduardo Pelegri-Llopart

Representative: Santiago Pericas-Geertsen

Nokia

Representative: Selim Balcisoy

Representative: Steve Lewontin

Timed Text

Representative: Glenn A. Adams

France Telecom

Representative: Fabrice Desré

CubeWerx Inc

Representative: Craig Bruce

Advanced Technologies Group NDS Ltd

Representative: Nigel Dallard

Telia Sonera

Representative: Laamanen Heimo

University of Helsinki

Representative: Prof. Kimmo Raatikainen

Dennis Sosnoski

Representative: Dennis M. Sosnoski

High Performance Technologies, Inc

Representative: Stephen D. Williams

Lionet

Representative: Christian Horn

Rick Marshall

CCSDS Packaging Working Group

Representative: Louis Reich

Expway

Representative: Cédric Thiénot

Representative: Robin Berjon

Software AG

Representative: Mike Champion

Representative: Trevor Ford

Canon

Representative: Alex Danilo

Representative: Jun Fujisawa

IBM

Representative: Michael (Mike) Conner

Representative: Noah Mendelsohn

XimpleWare

Representative: Jimmy Zhang

Representative: Kevin Lovette

Media Fusion

SAP

Representative: Canyang Kevin Liu

Systematic Software Engineering Ltd

Representative: Andrew Graham

Representative: Bjørn Reese

Swiss federal institute of technology (eth), Zurich

Representative: Erik Wilde

L3 communication

Representative: Bill Eller

Representative: Krissa Ross

Ontonet

Representative: Margaret Gree

MITRE Corporation

Representative: Mike Cokus

Representative: Dr. Scott Renner

BEA

Representative: David Orchard

Representative: Mark Nottingham

Tarari

Representative: Michael Leventhal

Representative: Eric Lemoine

Adobe Systems Inc.

Representative: Oliver Goldman

Representative: Larry Masinter

Microsoft

Representative: Shankar Pal

Representative: Jonathan Marsh

AgileDelta

Representative: John Schneider

Representative: Jim Trezzo

oracle

Representative: Dmitry Lenkov

Representative: Anish Karmarkar

OSS Nokalva

Representative: Alessandro Triglia

Representative: Michael Marchegay

HiT Software, Inc.

Representative: Giovanni Guardalben

KDDI

Representative: Kazunori Matsumoto

Representative: Takanari Hayama, Ph.D

CISCO

Representative: Alex Yiu-Man Chan

Web3D

Representative: Don Brutzman

Representative: Alan D. Hudson

Siemens

Resource Statement

This activity will consume 30% of the time of one W3C staff member for chairing the workshop, and 10% of the time of [the same] W3C staff member for managing the workshop website. This workshop is part of the W3C XML Activity.


Valid XHTML 1.0!