Binary XML Position Paper

The Need for Standard Schema-based and Hybrid Compression

 

Mike Cokus

Scott Renner

Dan Winkowski

 

The MITRE Corporation

{msc,sar,winkowski}@mitre.org

Executive Summary

 

US Air Force policy is to use XML for data which flows from one system to another, especially for data that flows from one automated process to another (as opposed to data intended for presentation to humans).  Some of these automated data flows will necessarily take place over wireless data links, where bandwidth is and will always be a severe constraint. Many of these exchanges are very small, well under 1000 bits.

 

Our position is:

1.        We want to use XML in these bandwidth-limited situations -- all of the usual advantages of web services apply here as well.

2.        In our application-level data exchanges -- especially in our wireless data link exchanges -- we can assume that data producers and consumers always have the XML schema for every document that is exchanged.

3.        By exploiting this shared schema knowledge in compressing the data, and by using an appropriate binary representation, we may keep the advantages of using XML, while reducing the number of bits that must be transmitted to something close to the information-theoretic optimum.  We have both theoretical hand-waving and experimental results in support of this belief.

4.        We need a single standard for this schema-based compression, such that any producer with the same schema and valid document will always produce the same binary, and anyone with the same schema and binary will always produce the same source document (or infoset). Otherwise, our configuration management problems will explode.  If we ever find ourselves worrying that "XYZ may not be able to handle this bitstream, because he may not have the ABC codec", then we're” hosed”.

5.        We believe that a hybrid (combining schema-based and redundancy schemes) approach will be required to support a comprehensive method of XML compression.  We need a single standard for hybrid compression methods for use as needed.  The identification of the compression method used should be part of the encoded form of the XML document.

6.        We do not want a military-only standard.  Those are bad.  We want to buy this capability as commercial off-the-shelf (COTS).  We believe there are civilian applications with similar needs. We want a W3C standard.

Experimental Results

 

In 2002, MITRE undertook a study [1] for the US Air Fore to examine how the verbosity of XML may be mitigated using existing technologies. The study focused on a message class consisting of mostly structured content. The message standard was designed originally for efficient binary transmission and a large binary message sample set was available for comparison. MITRE developed an XML W3C schema that was a faithful reflection of the structure of the message standard and following common practice used descriptive XML element names. The element content was defined using XML datatypes to represent the ASCII or numeric expression of the binary data. A gateway was developed to transform the binary sample message set into XML consistent with the schema definition.

 

The study examined three basic compression approaches, namely redundancy-based compression, schema-based encoding and hybrid techniques.  Redundancy-based compression methods take advantage of the repetition of character groups within the source document.  Schema-based methods leverage design knowledge of the structure and content of the source document to produce efficient field encodings.  Hybrid compression methods apply a combination of redundancy-based and schema-based methods to the same source document. Metrics on file size optimization were collected. Due to the differing maturity of the applications used in this study metrics on compression/decompression speeds are not reportable.

 

Two redundancy techniques were examined in the study. WinZip [2] is a PC application based on the Info-ZIP specification implementing the Lempel-Ziv (LZ77) and Huffman algorithms. We also implemented a “WBXML-like” encoding that applied WBXML [3] tag tokenization to the XML source documents based on code pages generated from an XML schema rather than a DTD.

 

Two schema-based compression methods were also selected. These methods work by encoding the file based on design knowledge about the permitted content and structure of the source document. It is analogous to a compilation stage that applies standard encoding rules to the source file to produce content suitable for machine-to-machine processing. In schema-based compression, the full XML tags are not represented in the encoded form but can be reconstructed using design knowledge during the decoding phase.  The technique also applies design knowledge to efficiently encode the data contents and places structural markers only when necessary into the file. Additionally, schema-based methods may allow content to be decoded in streaming mode.

 

The two schema-based techniques investigated were MPEG-7/BiM and Abstract Syntax Notation One (ASN.1). MPEG-7/BiM [4] is an ISO/IEC standard developed by Moving Picture Experts Group (MPEG). BiM (Binary Format for MPEG-7) is MPEG-7's XML schema-based compression standard.  Abstract Syntax Notation One [5], another ISO/IEC standard set, is used for defining messages for tree structured information exchange without regard to the implementation. Recent enhancements to the standard include support for XML.

 

Hybrid compression methods apply both redundancy-based and schema-based compression on the same source document.  Specifically, redundancy-based compression is applied after applying schema-based compression to further reduce the size of the compressed format.  Three hybrid approaches were selected, XMill, ASN.1 combined with WinZip, and MPEG-7/BiM combined with WinZip.

 

XMill [6], an open source research prototype developed by University of Pennsylvania and AT&T Labs, builds on the gzip library to provide an XML-specific compressor.  The path encoding options support selective application of type-specific (i.e., based on design knowledge of the source document) encoders to data content.  The data and content of the source document are compressed separately using redundancy compression.  Since XMill was the only native hybrid application selected for this study, the experiment included tests of each schema-based method followed by redundancy compression to provide a basis for comparison.

 

Tests were run using two sample sets of messages. The first, "Test Set A", did not have associated an equivalent binary sample due to a technical error. The second, "Test Set B", were produce directly from the binary sample set and therefore results include comparisons to the binary messages. Test Set A, being derived from the same message standard, is similar in structure and content to Test Set B and was included only because resources did not permit application of the ASN.1 techniques to Test Set B.

 

The message collections were grouped into files consisting of 6853, 500, 200, and 8 messages. The XML translated messages compared to the corresponding binary formats exhibited a ratio of 17:1 in size.

 

Test

Description

 Byte Size

Ratio vs. Binary

Ratio vs. XML

 

 

 

 

 

Test Set A XML

No Equivalent Binary Messages

 

 

 

 

6853 messages

9,878,662

 

 

 

8 messages

6,010

 

 

 

 

 

 

 

Test Set B XML

Equivalent Binary Messages Available for comparison

 

 

 

 

6853 messages

11,421,822

1694%

 

 

500 messages

855,297

1687%

 

 

200 messages

339,629

1699%

 

 

8 messages

12,209

1598%

 

 

 

 

 

 

Test Set A Results

 

 

 

 

 

 

 

 

 

WinZip Results for Test Set A with Max Compression

 

 

 

 

 

6853 messages WinZip (Max)

319,303

 

3.23%

 

8 messages WinZip (Max)

669

 

11.13%

 

 

 

 

 

WBXML-like Results for Test Set A

 

 

 

 

 

6853 messages, XSD-WBXML

4,148,070

 

41.99%

 

8 messages XSD-WBXML

2,567

 

42.71%

 

 

 

 

 

XMill Results for Test Set A

 

 

 

 

 

6853 messages, XMill Default

116,000

 

1.17%

 

6853 messages, XMill, plus path encoder

89,136

 

0.90%

 

 

 

 

 

 

8 messages, XMill Default

578

 

9.62%

 

8 messages, XMill, plus path encoder

1,152

 

19.17%

 

 

 

 

 

MPEG-7 Results for Test Set A

 

 

 

 

 

6853 messages, BiM (MPEG-7)

783,403

 

7.93%

 

8 messages, BiM (MPEG-7)

181

 

3.01%

 

 

 

 

 

ASN.1 (PER) Results for Test Set A

 

 

 

 

 

6853 Messages, PER Aligned

827983

 

8.39%

 

8 Messages, PER Aligned

58

 

0.98%

 

 

 

 

 

ASN.1 (PER)/WinzZip (Max) Results for Test Set A

 

 

 

 

 

6853 Messages, PER Aligned, WinZip (Max)

236,488

 

2.40%

 

8 Messages, PER Aligned, WinZip (Max)

154

 

2.61%

 

 

 

 

 

MPEG-7/WinzZip (Max) Results for Test Set A

 

 

 

 

 

6853 messages, BiM (MPEG-7), WinZip Max

231,762

 

2.35%

 

8 messages, BiM (MPEG-7), WinZip Max

181

 

3.01%

 

 

 

 

 

 

 

 

 

 

Test Set B Results

 

 

 

 

 

 

 

 

 

WinZip Results for Test Set B

 

 

 

 

 

6853 messages WinZip (Max)

330,189

49%

2.89%

 

500 messages WinZip (Max)

26,552

52%

3.10%

 

200 messages WinZip (Max)

11,844

59%

3.49%

 

8 messages WinZip (Max)

1,498

196%

12.27%

 

 

 

 

 

 

 

 

 

 

MPEG-7 Results for Test Set B

 

 

 

 

 

6853 messages BiM (MPEG-7)

1,046,827

155%

9.17%

 

500 messages BiM (MPEG-7)

75,744

149%

8.86%

 

200 messages BiM (MPEG-7)

30,196

151%

8.89%

 

8 messages BiM (MPEG-7)

1,210

158%

9.91%

 

 

 

 

 

WBXML-like Results for Test Set B

 

 

 

 

 

6853 messages Revised XSD-WBXML

4857021

721%

42.52%

 

500 messages Revised XSD-WBXML

361664

713%

42.29%

 

200 messages Revised XSD-WBXML

143671

719%

42.30%

 

8 messages Revised XSD-WBXML

5198

680%

42.58%

 

 

 

 

 

XMill Results for Test Set B

 

 

 

 

 

6853 messages, XMill default

117,652

17%

1.03%

 

6853 messages, XMill plus path encoder

91,007

14%

0.80%

 

 

 

 

 

 

500 messages, XMill default

12,694

25%

1.48%

 

500 messages, XMill plus path encoder

10,885

21%

1.27%

 

 

 

 

 

 

200 messages, XMill default

6,635

33%

1.95%

 

200 messages, XMill plus path encoder

5,967

30%

1.76%

 

 

 

 

 

 

8 messages, XMill default

1,422

186%

11.65%

 

8 messages, XMill plus path encoder

1,750

229%

14.33%

 

 

 

 

 

 

 

 

 

 

MPEG-7/WinzZip (Max) Results for Test Set B

 

 

 

 

 

6853 messages BiM (MPEG-7), WinZip Max

291,607

43%

2.55%

 

500 messages BiM (MPEG-7), WinZip Max

22,966

45%

2.69%

 

200 messages BiM (MPEG-7), WinZip Max

10,148

51%

2.99%

 

8 messages BiM (MPEG-7), WinZip Max

1,061

139%

8.69%

 

 

The study findings are detailed in the paper. Generally, we note that there exists a point at which redundancy based compression overtakes the efficiency of schema-based encoding and that this point is relatively low for the data set tested, roughly estimated to be in the neighborhood of 10,000 bytes. Schema-based techniques appear to be optimal for smaller XML files due to the efficiencies of compacted localized field and structural encoding. Application of redundancy compression for smaller XML files provides no benefit or proves to be a disadvantage.

 

Hybrid techniques, combining schema-based and redundancy compression, produced the best results overall with the exception of the smallest sample in Test A for which ASN.1 PER encoding yielded better results. XMill particularly demonstrated a significant advantage over WinZip. We expect schema-based encodings to be of significant advantage versus the other techniques for XML messages of this type for sizes 10,000 bytes or less.

 

Multiple techniques are required for a generalized approach to XML compression in order to obtain optimal size reduction across a broad range of file sizes.

 

Finally, XML compressed/encoded messages proved to be competitive in size to the original binary formats. Larger XML message collections, when reduced were significantly smaller than the original binary messages. Small XML messages when reduced proved competitive in size to their binary equivalent.

 

Position

 

The US Air Force desires that the W3C develop a standard for optimizing the size of XML content for transmission and storage. Secondary optimization factors such as compression/decompression speed and decoder application processing requirements should be considered. Update, query and navigability of the compressed form in our view should not be a factor in developing a standard. A streamed decoding capability typical of schema-based encoding techniques is desirable.

 

Though there is a wide range on the size of information exchanges that take place over networks, a particular bottleneck is the transmission of small exchanges that occur over network paths of limited bandwidth. Due to spectrum limitations these communications links are at a severe disadvantage when it comes to passing XML formatted content and this fact complicates IT architectures and impedes adoption of XML in the enterprise. The class of information exchanges designed as machine-to-machine exchanges where the use of XML syntax may degrade performance in high transaction environments is another impediment.

 

We believe the standard should combine both redundancy and schema-based algorithms. Consideration should be given as to how to ensure that use of this standard can be transparently integrated into XML processors and transmission protocols. Another consideration is packaging to identify the necessary parameters so that the appropriate decompression algorithm can be applied.

 

There exists a high potential for an optimized interchange standard as it applies to US military IT architectures and messaging standards. Current binary military message design practice, driven by requirements for bandwidth efficiency, are typically crafted to express field content and structure using the fewest bits possible. Design approaches vary but normally the binary physical encoding of the structure and fields are captured by rules unique to each specification. Codification of these rules is the hallmark of the schema-based approach. We believe that separating message design from the physical encoding requirements is a good design technique that needn’t sacrifice efficiency and view binary and XML serialization of a message model as alternate encodings. A standard as we described would provided for the unimpeded flow of information across the entire network, from a enterprise application server down to an embedded or hand held device.

 

 

References

1. “XML Sizing and Compression Study for Military Wireless Data” Mike Cokus & Dan Winkowski, MITRE XML 2002  http://www.idealliance.org/papers/xml02/dx_xml02/papers/06-02-04/06-02-04.html  Dec 2002.

 

2. WinZip, http://www.winzip.com/aes_info.htm

 

3. WBXML, http://www.wapforum.org

 

4. MPEG-7, http://www.mpeg-industry.com

 

5. ASN.1, http://www.itu.int/ITU-T/asn1/

 

6. XMill, http://www.research.att.com/sw/tools/xmill