Xerox Document Services Document Model

Workshop Position Paper 12 May 2004

This version:
Editor:
Leigh L. Klotz, Jr., Xerox Corporation <Leigh.Klotz@pahv.xerox.com>

Abstract

The Xerox Document Services Document Model (XDSDM) is part of the Xerox Document Services Platform, a prototype XML application under development for sequencing invocations of operations on a document model. The model represents the identity, content, and meta-data of compound documents, with a focus on scanned (image) documents and documents that are derived from them, although it is applicable to other other document types as well. The document model, the sequencing mechanism, and the operations themselves are defined in a modular and extensible way. This paper discusses work in progress on the document model.

The structure of the XDSDM is designed to allow easy access and manipulation of the model by XPath expressions (see [XPath 1.0]), as is done in XForms (see [XForms 1.0]). Document Services that operate on the XDSDM can retrieve documents, metadata, and renditions from the model by using XPath expressions, and can update the model to add or change renditions or metadata through the use of XML Events (see [XML Events]) with DOM mutation actions, whose location is specified by XPath expressions and whose contents are described in the event payload.

The document content itself is not stored in the model, but is referred to by URI (see [RFC 2396]).

The following issues arise related to combining multiple vocabularies:

[Issue xml:id]
[Issue mustUnderstand]
[Issue Foreign Elements and Ordering]

The following issue arises related to web applications:

[Issue XML Handlers]



Status of this Document

This document is a position paper for the [The W3C Workshop on Web Applications and Compound Documents] It represents a work in progress, and is not supported by Xerox Corporation.

Table of Contents

1 Introduction
    1.1 Background
    1.2 Documentation Conventions
2 Document Structure
    2.1 Common Attributes
        2.1.1 Attribute doc:id
        2.1.2 Attribute doc:mustUnderstand
    2.2 Elements related to documents
        2.2.1 Element documents
        2.2.2 Element document
    2.3 Elements related to renditions
        2.3.1 Element doc:renditions
        2.3.2 Element doc:rendition
        2.3.3 Element doc:renditionSequence
        2.3.4 Namespace http://www.example.com/rendition-info
    2.4 Elements related to Meta Data
        2.4.1 Element doc:metadata
        2.4.2 Dublin Core Elements
3 Glossary Of Terms

Appendices

A Schemas for Xerox Document Service Document Model
    A.1 Schema for Document Model
    A.2 Schema for Rendition Information
B References
    B.1 Normative References
    B.2 Informative References
C Xerox Document Service Document Model Use Example (Non-Normative)
    C.1 Before OCR
    C.2 After OCR
D Changelog (Non-Normative)
E Acknowledgments (Non-Normative)
F Production Notes (Non-Normative)


1 Introduction

1.1 Background

While electronic documents are a dominant and increasing part of the business world, paper documents have not disappeared, and are in fact increasing in absolute terms. While paper documents offer significant advantages in reading, understanding, some kinds of editing, under certain legal conditions, and in the many other affordances of paper, electronic documents are undeniably the coin of the realm in today's business world. Increasingly, business is dealing with compound documents containing both paper and electronic forms. Storing copies of paper documents electronically gives the best of both worlds, allowing the full ease of electronic document operations to be applied to paper documents, yet allows paper access to electronic documents when necessary and valuable. Capturing paper documents in a usable electronic form — and being able to print, copy, or otherwise operate on them from a desktop computer or a networked document appliance — is of great importance to business.

Documents are most valuable when they are , and can serve as memory association triggers. Many paper documents are situated by physical filing systems and human spatial memory, and not all of the information about the documents (meta-data) is in a form that is easily captured electronically. Electronic documents are usually situated by use of meta-data, or named properties of documents. Scanned electronic documents rapidly lose their value if the association between the document identity, document content, and documnt meta-data is lost.

Capturing paper documents has traditionally been an expensive business process. First, documents are scanned, and saved to removable media or an extranet, then checked for quality and possibly re-scanned, then sent to a "coding bureau" to have meta-data typed in and associated, and then finally shipped to a document repository. The total cost of this cycle is tremendous, as information present at each step is lost before the next and must be recreated, at great cost. In summary, the process of capturing scanned document meta-data for scanned documents is labor intensive, and is best done closest to the source of the documents, as it can be expensive to recover the meta-data at a later date, or by someone other than the document's owner.

Document Services re-envisions the paper-electronic boundary, and uses a capture technology that associates the meta-data with the document as soon as possible, when the document is still situated, and gives immediate feedback about the document quality, thus reducing the cost of both the capture and QA steps of document capture.

A typical paper business document achieves importance by being in the hands of a knowlege worker, who not only knows the value of the document, but also knows the context. Thus, he or she is an idea person to capture the meta-data associated with the value and context of the document, and to approve the quality of its capture. Unfortunately, traditional document capture technologies are cumbersome and time-consuming, so it is not cost-effective to pay knowledge workers to handle their own documents. A Document Services-based system aims to reduce the cost of handling and capturing documents to produce rich repositories of electronic knowledge, at low cost, by integrating the handling of paper and electronc documents into the normal work practice of knowledge workers, with operations that are defined in their terms, rather than focused on traditional scanning procedures.

This specification defines one important part of a compound paper-electronic document processing system, the Xerox Document Services Document Model, which is an XML instance document modeling the documents under processing, and holding their rendition and meta data information. Other key components of a document processing system are listed here, but are beyond the scope of this paper: a Document Service Orchestrator, which accepts a workflow definition XML document describing a process for performing document services. Document Services include capturing a document, adding meta-data to it, performing quality assurance, apply transformations such as OCR to both renditions and meta-data, storing in a document repository and dispatching the document to a target such as a printer or e-mail address.

Manipulation of the XDSDM by document services is done through XPath expressions (see [XPath 1.0]), as is done in XForms (see [XForms 1.0]). In fact, the XDSDM itself is similar to the instance document in XForms, but instead of being modified through user interface controls, it is modified by document services.

Document Services that operate on the XDSDM can retrieve documents, metadata, and renditions from the model by using XPath expressions, and can update the model to add or change renditions or metadata through the use of XML Events (see [XML Events]) with DOM mutation actions, whose location is specified by XPath expressions and whose contents are described in the event payload. Unfortunately, the XML Events specification does not specify action handlers, and so the DOM mutation handlers are presently implemented as shorthand for an XSLT transformation (see [XSLT 1.0]) in which the action handler body is an transformed into an XSLT transformation, which is then applied to the identified element document in the XDSDM with the event payload available as a the result of an XSLT extension function in XPath expressions.

Issue (issue-xml-event-handlers):

XML Handlers

A recommendation for DOM Mutation and scripting in XML Event handlers would be most welcome.



The document content itself is not stored in the model, but is refererred to by URI (see [RFC 2396]), and is compatible with the XForms 1.0 element upload and XForms 1.0 submission methodmultipart-related serialization.

1.2 Documentation Conventions

Throughout this document, the following namespace prefixes and corresponding namespace identifiers are used:

doc:The Document Services Document Model namespace (http://www.example.com/document) A.1 Schema for Document Model
ri:The Document Services Rendition Information namespace (http://www.example.com/rendition-info) A.2 Schema for Rendition Information
xsd:The XML Schema namespace (http://www.w3.org/2001/XMLSchema)[XML Schema part 1]
xsi:The XML Schema for instances namespace (http://www.w3.org/2001/XMLSchema-instance)[XML Schema part 1]
my:Any user defined namespace

2 Document Structure

The XDSDM is derived from the document model of [System 33], in which a document is separated into a triple:

an identity with a unique handle
a set of parallel renditions representing the content of the document, each rendition having a series of named properties
a set of named metadata items

In XDSDM, each of the items in this triple is represented by an element: the identity by the element document, the renditions by a sequence of elements rendition, and the metadata items by a containing element metadata. The content of the renditions themselves are not stored in the model, but are referenced by an attribute on rendition.

An XDSDM element documents contains zero or more documents, each of which can have zero or more renditions (content), and zero or more pieces of metadata. XML Schema descriptions for the XDSDM instance, document, rendition, and metadata structures are given. These schemas use XML namespaces for extensibility. Other XML applications such as [Guidelines for implementing Dublin Core in XML] are used where appropriate.

2.1 Common Attributes

2.1.1 Attribute doc:id

Attribute doc:id are common to most elements in this proposal; however the use of multiple namespaces complicates the question of the namespace for the declaration of attribute id.

Issue (issue-id-attribute):

xml:id

An attribute xml:id added to the XML namespace would simplify matters greatly for XML applications using containing languages and multiple namespaces.



Foreign attributes are generally allowed, but and services may ignore them.

2.1.2 Attribute doc:mustUnderstand

Services must process all elements and attributes in the following namespaces:

http://www.example.com/document
http://www.example.com/rendition-info
http://purl.org/dc/elements/1.1/

The attribute doc:mustUnderstand is used on any child element of metadata or rendition to indicate that any service processing the document must understand that element, and must not process the document if it does not. This concept is borrowed from [SOAP 1.2] and [XForms 1.0].

Issue (issue-mustUnderstand-attribute):

mustUnderstand

A common namespace for this concept would be beneficial to producers and services of loosely coupled multiple-namespace documents.



2.2 Elements related to documents

2.2.1 Element documents

XDSDM provides a containing element documents which holds a sequence of elements document.

Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.

2.2.2 Element document

In XDSDM, the document identity is provided by element document, with the unique handle provided by attribute id, which is unique only to the particular model. The model is composed of an XML document containing a sequence of zero or more elements document.

Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.

Issue (issue-foreign-elements-ordering):

Foreign Elements and Ordering

We would like element document to contain at most one element renditions and at most one element metadata, but any number of foreign elements. It is difficult to express the unordered choice of zero or one of these specified elements and at the same time allow any number of unordered foreign elements. This problem puts uncomfortable constraints on documents with multiple XML applications.

2.3 Elements related to renditions

2.3.1 Element doc:renditions

The element doc:renditions serves as a containing element for elements doc:rendition.

Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.

2.3.2 Element doc:rendition

A document can have zero or more child elements doc:rendition. Each rendition is a whole rendition of the document, though the content type, quality, fidelity, and other attributes of the rendition may vary.

Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.

2.3.3 Element doc:renditionSequence

Some documents are composed of an ordered sequence of renditions; for example, a document consisting of a scanned TIFF [TIFF 6.0] file followed by a PDF file [PDF 3.0], would have a doc:rendition containing a doc:renditionSequence containing a sequence of two doc:rendition elements.

Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.

2.3.4 Namespace http://www.example.com/rendition-info

While the A.1 Schema for Document Model provides basic information about the existence of renditions and the location of their content, it provides no information about the rendition itself. Any namespace is allowed as a child element of doc:rendition, but for interoperability, this paper proposes a canonical set of rendition information elements in the namespace http://www.example.com/rendition-info.

While all renditions of a document are in some sense equivalent, they do have different properties; for example, an original scanned image will have near 100% fidelity to the paper document, but an OCR'd version of the document as plain text would have a low fidelity, perhaps 10%, and an uncorrected accuracy of perhaps 85%. The Schema for these and other common properties of renditions is given in A.2 Schema for Rendition Information.

2.4 Elements related to Meta Data

2.4.1 Element doc:metadata

The element doc:metadata specifies a sequence of any items in any other namespace. It is up to the application using the document model to place constraints on the type of metadata to be gathered; however, see 2.4.2 Dublin Core Elements.

Foreign attributes are allowed, and services may ignore them. Foreign elements are allowed, and processing is subject to 2.1.2 Attribute doc:mustUnderstand.

2.4.2 Dublin Core Elements

The [Guidelines for implementing Dublin Core in XML] specify an embedding of Dublin Core elements in XML, and it is proposed that Dublin Core metadata items be used where practical. Services must understand these elements, and must not process a document if not.

3 Glossary Of Terms

document service

[Definition: A generic term for systems and services that process documents, but in this paper used specifically to refer to services on scanned image documents and their derivatives. Services include document capture, transformation, and distribution. ]

document services document model

[Definition: A Document Services Document Model is a single element documents which serves as a container for a series of documents]

document services-orchestrator

[Definition: A processor designed to apply a sequence of to a .]

OCR

[Definition: A class of that accepts an mediaType image/* rendition produces a new rendition of type text/* (or similar coded type) and optionally also produces new metadata.]

document repository

[Definition: A document storage and retrieval facility, such as a file server, web server, or other system.]

situated

[Definition: Situated documents obtain meaning from physical context. ]

target

[Definition: A destination for a document, such as a or a printer.]

meta-data

[Definition: Data about a document, separate from its content; For example, the type of document is metadata -- contract, letter, newspaper clipping.]

document

[Definition: In this paper, "document" refers to a scanned image document or a coded document derived from one.]

rendition

[Definition: A rendition of a document is a reference the content of the document, as distinct from the location or identity of the document, or its meta-data. Documents can have multiple renditions, each with different properties; for example, there may be both an image and a text rendition of a document.]

A Schemas for Xerox Document Service Document Model

The example XML Schemas for XDSDM and related Rendition Information and Meta-Data namespaces are below:

A.1 Schema for Document Model

This is the XML Schema for the Document Model

<xs:schema xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:doc="http://www.example.com/document" targetNamespace="http://www.example.com/document" elementFormDefault="qualified">

  
<xs:import namespace="http://purl.org/dc/elements/1.1/" schemaLocation="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd" />

  
<!-- Foreign attributes allowed -->
  
<xs:attributeGroup name="Attributes">
    
<xs:anyAttribute namespace="##other" />
  
</xs:attributeGroup>

  
<!--elements  -->
  
<xs:element name="documents" type="doc:documentsType" />
  
<xs:element name="document" type="doc:documentType" />
  
<xs:element name="metadata" type="doc:metadataType" />
  
<xs:element name="rendition" type="doc:renditionType" />
  
<xs:element name="renditions" type="doc:renditionsType" />
  
<xs:element name="renditionSequence" type="doc:renditionSequenceType" />
  
  
<!--types -->
  
<xs:complexType name="documentsType">
    
<xs:sequence minOccurs="0" maxOccurs="unbounded">
      
<xs:choice>
        
<xs:element ref="doc:document" />
        
<xs:any namespace="##other" />
      
</xs:choice>
    
</xs:sequence>
    
<xs:attributeGroup ref="doc:Common.Attributes" />
    
<xs:attribute name="id" type="xs:ID" use="optional" />
  
</xs:complexType>      

  
<xs:complexType name="documentType">
    
<xs:sequence minOccurs="0" maxOccurs="unbounded">
      
<xs:choice>
        
<!-- We really want doc:renditions to be maxOccurs="1" -->
        
<xs:element ref="doc:renditions" />
        
<!-- We really want doc:metadata to be maxOccurs="1" -->
        
<xs:element ref="doc:metadata" />
        
<xs:any namespace="##other" />
      
</xs:choice>
    
</xs:sequence>
    
<xs:attributeGroup ref="doc:Common.Attributes" />
    
<!-- There should be an xml:id attribute so document
         do not have conflicts with the containing language 
-->
    
<xs:attribute name="id" type="xs:ID" use="required" />
  
</xs:complexType>
  
  
<xs:complexType name="metadataType">
    
<xs:sequence minOccurs="0" maxOccurs="unbounded">
      
<xs:choice>
        
<xs:any namespace="http://dublincore.org/schemas/xmls/qdc/2003/04/02/dc.xsd" />
        
<xs:any namespace="##other" />
      
</xs:choice>
    
</xs:sequence>
    
<xs:attribute name="document" type="xs:IDREF" use="optional" />
    
<xs:attribute name="id" type="xs:ID" use="required" />
  
</xs:complexType>
  
  
<!-- Each rendition in a renditions is considered a separate, equivalent rendition of the document -->
  
<xs:complexType name="renditionsType">
    
<xs:sequence>
      
<xs:element ref="doc:rendition" minOccurs="0" maxOccurs="unbounded" />
      
<xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded" />
    
</xs:sequence>
    
<xs:attributeGroup ref="doc:Common.Attributes" />
    
<xs:attribute name="document" type="xs:IDREF" use="optional" />
    
<xs:attribute name="id" type="xs:ID" use="required" />
  
</xs:complexType>      

  
<!-- A renditionSequence is ordered sequence of renditions constituting one rendition-->
  
<xs:complexType name="renditionSequenceType">
    
<xs:sequence>
      
<xs:element ref="doc:rendition" minOccurs="0" maxOccurs="unbounded" />
    
</xs:sequence>
    
<xs:attributeGroup ref="doc:Common.Attributes" />
    
<xs:attribute name="document" type="xs:IDREF" use="optional" />
    
<xs:attribute name="id" type="xs:ID" use="required" />
  
</xs:complexType>

  
<xs:complexType name="renditionType">
    
<xs:sequence>
      
<xs:element ref="doc:renditionSequence" minOccurs="0" maxOccurs="unbounded" />
      
<xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded" />
    
</xs:sequence>
    
<xs:attribute name="src" type="xs:anyURI" use="optional" />
    
<xs:attribute name="document" type="xs:IDREF" use="optional" />
    
<xs:attribute name="id" type="xs:ID" use="optional" />
    
<xs:attributeGroup ref="doc:Common.Attributes" />
  
</xs:complexType>

</xs:schema>

A.2 Schema for Rendition Information

This is the XML Schema for Rendition Information. Rendition Information is a common set of rendition properties that are expected to be understood by all services, but are not the exclusive set of properties.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:ri="http://www.xerox.com/dsp/2002/gemini/rendition-info" targetNamespace="http://www.example/rendition-info" elementFormDefault="qualified">
  
<!-- Element declarations -->
  
<!-- Use elements rather than attributes to allow for @mustUnderstand -->
  
<xs:element name="language" type="ri:languageType" />
  
<xs:element name="typesetting" type="ri:typesettingType" />
  
<xs:element name="filename" type="xs:string" />
  
<xs:element name="xresolution" type="xs:decimal" />
  
<xs:element name="yresolution" type="xs:decimal" />
  
<!-- 0-100% for fidelity and accuracy? -->
  
<xs:element name="fidelity" type="xs:decimal" />
  
<xs:element name="accuracy" type="xs:decimal" />
  
<xs:element name="contentType" type="xs:string" />
  
<xs:attribute name="contentLength" type="xs:nonNegativeInteger" />
  
<xs:element name="pageCount" type="xs:nonNegativeInteger" />
  
<!-- types -->
  
<xs:simpleType name="languageType">
    
<xs:restriction base="xs:string" />
  
</xs:simpleType>
  
<xs:simpleType name="typesettingType">
    
<xs:restriction base="xs:string">
      
<xs:enumeration value="printed" />                
      
<xs:enumeration value="dotMatrix24" />
      
<xs:enumeration value="dotMatrix9" />                        
      
<xs:enumeration value="handPrint" />
    
</xs:restriction>
  
</xs:simpleType>
</xs:schema>

B References

B.1 Normative References

XForms 1.0
XForms 1.0, M Dubinko, et. al, 2003. W3C Recommendation available at http://www.w3.org/TR/xforms/ .
RFC 2396
RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax, T. Berners-Lee, R. Fielding, L. Masinter, 1998. Available at http://www.ietf.org/rfc/rfc2396.txt.
XHTML Modularization
Modularization of XHTML, M. Altheim, et al., 2001. W3C Recommendation available at http://www.w3.org/TR/xhtml-modularization/ .
XML Base
XML Base, Jonathan Marsh, 2001. W3C Recommendation available at http://www.w3.org/TR/xmlbase/ .
XML 1.0
Extensible Markup Language (XML) 1.0 (Second Edition), Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, 2000. W3C Recommendation available at http://www.w3.org/TR/REC-xml
XML Names
Namespaces in XML, Tim Bray, Dave Hollander, Andrew Layman, 1999. W3C Recommendation available at http://www.w3.org/TR/REC-xml-names .
SOAP 1.2
SOAP Version 1.2 Part 0: Primer, Nilo Mitra, 2003. W3C Recommendation available at http://www.w3.org/TR/soap12-part0/ .
XPath 1.0
XML Path Language (XPath) Version 1.0, James Clark, Steve DeRose, 1999. W3C Recommendation available at http://www.w3.org/TR/xpath .
XSLT 1.0
XSL Transformations (XSLT) Version 1.0, James Clark, 1999. W3C Recommendation available at http://www.w3.org/TR/xslt .
XML Schema part 1
XML Schema Part 1: Structures, Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn, 2001. W3C Recommendation available at http://www.w3.org/TR/xmlschema-1/ .
XML Schema part 2
XML Schema Part 2: Datatypes, Paul V. Biron, Ashok Malhotra, 2001. W3C Recommendation available at http://www.w3.org/TR/xmlschema-2/ .

B.2 Informative References

XML Events
XML Events - An events syntax for XML, Steven Pemberton, T. V. Raman, Shane P. McCarron, 2003. W3C Recommendation available at http://www.w3.org/TR/xml-events/ .
XHTML 1.0
XHTML 1.0: The Extensible HyperText Markup Language - A Reformulation of HTML 4 in XML 1.0, Steven Pemberton, et al., 2000. W3C Recommendation available at http://www.w3.org/TR/xhtml1 .
XML Schema part 0
XML Schema Part 0: Primer, David C. Fallside, 2001. W3C Recommendation available at http://www.w3.org/TR/xmlschema-0/ .
System 33
Design and Implementation of the System 33 Document Service, Putz, Steve, 1993. Xerox PARC P93-00112. Available at http://wwww.parc.com/about/history/publications/bw-ps/system33.ps .
Guidelines for implementing Dublin Core in XML
Guidelines for implementing Dublin Core in XML, Powell, Andy, et. al. Available at http://dublincore.org/documents/dc-xml-guidelines/ .
The W3C Workshop on Web Applications and Compound Documents
The W3C Workshop on Web Applications and Compound Documents. Available at http://www.w3.org/2004/04/webapps-cdf-ws/ .
TIFF 6.0
TIFF 6.0, Adobe Systems, Incorporated, 1992. Available at http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf.
PDF 3.0
PDF Reference, Third Edition, Version 1.4. Adobe Systems, Incorporated, 2003. Addison-Wesley, ISBN 0-201-75839-3. Available at http://partners.adobe.com/asn/acrobat/docs/File_Format_Specifications/PDFReference.pdf .

C Xerox Document Service Document Model Use Example (Non-Normative)

This section presents an example use of the XDSDM in . The first example shows a job document before OCR, and the second shows how the documents instance is updated by the OCR service.

C.1 Before OCR

<o:job xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:transformation="http://www.example.com/transformation" xmlns:template="http://www.example.com/template" xmlns:services="http://www.example.com/services" xmlns:rq="http://www.example.com/rendition-request" xmlns:ri="http://www.example.com/rendition-info" xmlns:rfc822="urn:IANA:namespace:rfc822" xmlns:repositories="http://www.example.com/repositories" xmlns:ocr="http://www.example.com/ocr" xmlns:o="http://www.example.com/orchestration" xmlns:ev="http://www.w3c.org/2002/xml-events" xmlns:emx="urn:ietf:params:email-xml" xmlns:email="http://www.example.com/email" xmlns:doc="http://www.example.com/document" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:containers="http://www.example.com/containers">

  
<!-- The Orchestration data model -->
  
<o:data>
    
<!-- The Xerox Document Services Document Model before addition of the new rendition generated by OCR.  -->
    
<doc:documents>
      
<doc:document id="input-document">
        
<doc:metadata xmlns="">
          
<dc:description xml:lang="en">Example.com</dc:description>
          
<dc:title>Purchase Order</dc:title>
          
<ClientNumber>7764</ClientNumber>
        
</doc:metadata>
        
<doc:renditions>
          
<doc:rendition id="scanned-rendition" src="file://localhost/documents/03fbd3c8.tiff">
            
<contentType>image/tiff</contentType>
            
<ri:xresolution>300</ri:xresolution>
            
<ri:yresolution>300</ri:yresolution>
            
<ri:fidelity>100</ri:fidelity>
            
<ri:pageCount>42</ri:pageCount>
            
<ri:contentLength>12427305</ri:contentLength>
          
</doc:rendition>
          
<doc:rendition id="ocr-rendition" />
        
</doc:renditions>
      
</doc:document>
    
</doc:documents>
  
</o:data>

  
<!-- A step in a simple Document Services Orchestration document that operates on the above document model.  -->
  
<o:step>
    
<containers:TransformDocument implementation="OCR">
      
<!-- A TransformDocument container is ready-to-run when all previous steps have
           finished contributing renditions and at least one rendition on the
           specified document matches the renditionsRequired constraints.
           During execution of any handler in this container, the renditions() XPath
           function returns the nodeset of renditions matching these requirements.
        
-->
      
<containers:input document="InputDoc">
        
<rq:renditionRequest>
          
<ri:content-type>image/tiff image/*</ri:content-type>
          
<rq:minimum><ri:resolution>300</ri:resolution></rq:minimum>
        
</rq:renditionRequest>
      
</containers:input>
      
<containers:output rendition="OCR" />
      
<action ev:event="containers:invoke">
        
<containers:invoke>
          
<template:template name="services:TransformData">
            
<renditions>
              
<template:copy select="ocr:bestRendition(renditions())" />
            
</renditions>
            
<transformation:renditionRequest xsi:type="ocr:ocrRenditionRequest">
              
<ocr:recognizeText>
                
<ri:language>en</ri:language>
                
<ocr:textFormat>Searchable PDF</ocr:textFormat>
                
<ocr:tradeOff>speed</ocr:tradeOff>
                
<ri:typesetting>printed</ri:typesetting>
                
<ocr:layout>auto</ocr:layout>
              
</ocr:recognizeText>
            
</transformation:renditionRequest>
          
</template:template>
        
</containers:invoke>
       
</action>
      
</containers:TransformDocument>
    
</o:step>

    
<!-- See http://www.ninebynine.org/IETF/Messaging/Intro.html 
         and the current RFC draft draft-klyne-message-xml-00a.txt 
-->
    
<o:step>
      
<containers:SendDocument implementation="Email" groupName="Aaron">
        
<containers:input document="InputDoc" rendition="instance('documentData')/id('OCR')" />
        
<action ev:event="step">
          
<containers:invoke>
            
<template:template name="services:SendData">
              
<emx:Message>
                
<rfc822:subject><template:value select="metadata()/Title" /></rfc822:subject>
                
<emx:content type="text/plain">
                  
<template:value select="metadata()/Description" />
                
</emx:content>
                
<emx:content>
                   
<template:attribute name="type" value="rendition()/@content-type" />
                   
<template:copy select="rendition()" />
                
</emx:content>
                
<rfc822:to>
                  
<emx:Address>
                    
<emx:adrs>mailto:Fred.Derf@example.com</emx:adrs>
                    
<emx:name>Fred Derf</emx:name>
                  
</emx:Address>
                
</rfc822:to>
              
</emx:Message>
            
</template:template>
          
</containers:invoke>
      
</action>
    
</containers:SendDocument>
  
</o:step>

  
<o:completion>
    
<NamedEmailConfirmation>
      
<ConfirmationAddress>
        
<emx:Address>
          
<emx:adrs>mailto:Fred.Derf@usa.xerox.com</emx:adrs>
          
<emx:name>Fred Derf</emx:name>
        
</emx:Address>
      
</ConfirmationAddress>
    
</NamedEmailConfirmation>
  
</o:completion>

</o:job>

C.2 After OCR

<o:job xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:transformation="http://www.example.com/transformation" xmlns:template="http://www.example.com/template" xmlns:services="http://www.example.com/services" xmlns:rq="http://www.example.com/rendition-request" xmlns:ri="http://www.example.com/rendition-info" xmlns:rfc822="urn:IANA:namespace:rfc822" xmlns:repositories="http://www.example.com/repositories" xmlns:ocr="http://www.example.com/ocr" xmlns:o="http://www.example.com/orchestration" xmlns:ev="http://www.w3c.org/2002/xml-events" xmlns:emx="urn:ietf:params:email-xml" xmlns:email="http://www.example.com/email" xmlns:doc="http://www.example.com/document" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:containers="http://www.example.com/containers">

  
<!-- The Orchestration data model -->
  
<o:data>
    
<!-- The Xerox Document Services Document Model after addition of the new rendition generated by OCR.  -->
    
<doc:documents>
      
<doc:document id="input-document">
        
<doc:metadata xmlns="">
          
<dc:description xml:lang="en">Example.com</dc:description>
          
<dc:title>Purchase Order</dc:title>
          
<ClientNumber>7764</ClientNumber>
        
</doc:metadata>
        
<doc:renditions>
          
<doc:rendition id="scanned-rendition" src="file://localhost/documents/03fbd3c8.tiff">
            
<contentType>image/tiff</contentType>
            
<ri:xresolution>300</ri:xresolution>
            
<ri:yresolution>300</ri:yresolution>
            
<ri:fidelity>100</ri:fidelity>
            
<ri:pageCount>42</ri:pageCount>
            
<ri:contentLength>12427305</ri:contentLength>
          
</doc:rendition>
          
<doc:rendition id="ocr-rendition" src="file://localhost/documents/273ffde.txt">
            
<ri:contentType>text/plain</ri:contentType>
            
<ri:contentLength>42930</ri:contentLength>
            
<ri:fidelity>10</ri:fidelity>
            
<ri:accuracy>90</ri:accuracy>
          
</doc:rendition>
        
</doc:renditions>
      
</doc:document>
    
</doc:documents>
  
</o:data>

  
<!-- A step in a simple Document Services Orchestration document that operates on the above document model.  -->
  
<o:step>
    
<containers:TransformDocument implementation="OCR">
      
<!-- A TransformDocument container is ready-to-run when all previous steps have
           finished contributing renditions and at least one rendition on the
           specified document matches the renditionsRequired constraints.
           During execution of any handler in this container, the renditions() XPath
           function returns the nodeset of renditions matching these requirements.
        
-->
      
<containers:input document="InputDoc">
        
<rq:renditionRequest>
          
<ri:content-type>image/tiff image/*</ri:content-type>
          
<rq:minimum><ri:resolution>300</ri:resolution></rq:minimum>
        
</rq:renditionRequest>
      
</containers:input>
      
<containers:output rendition="OCR" />
      
<action ev:event="containers:invoke">
        
<containers:invoke>
          
<template:template name="services:TransformData">
            
<renditions>
              
<template:copy select="ocr:bestRendition(renditions())" />
            
</renditions>
            
<transformation:renditionRequest xsi:type="ocr:ocrRenditionRequest">
              
<ocr:recognizeText>
                
<ri:language>en</ri:language>
                
<ocr:textFormat>Searchable PDF</ocr:textFormat>
                
<ocr:tradeOff>speed</ocr:tradeOff>
                
<ri:typesetting>printed</ri:typesetting>
                
<ocr:layout>auto</ocr:layout>
              
</ocr:recognizeText>
            
</transformation:renditionRequest>
          
</template:template>
        
</containers:invoke>
       
</action>
      
</containers:TransformDocument>
    
</o:step>

    
<!-- See http://www.ninebynine.org/IETF/Messaging/Intro.html 
         and the current RFC draft draft-klyne-message-xml-00a.txt 
-->
    
<o:step>
      
<containers:SendDocument implementation="Email" groupName="Aaron">
        
<containers:input document="InputDoc" rendition="instance('documentData')/id('OCR')" />
        
<action ev:event="step">
          
<containers:invoke>
            
<template:template name="services:SendData">
              
<emx:Message>
                
<rfc822:subject><template:value select="metadata()/Title" /></rfc822:subject>
                
<emx:content type="text/plain">
                  
<template:value select="metadata()/Description" />
                
</emx:content>
                
<emx:content>
                   
<template:attribute name="type" value="rendition()/@content-type" />
                   
<template:copy select="rendition()" />
                
</emx:content>
                
<rfc822:to>
                  
<emx:Address>
                    
<emx:adrs>mailto:Fred.Derf@example.com</emx:adrs>
                    
<emx:name>Fred Derf</emx:name>
                  
</emx:Address>
                
</rfc822:to>
              
</emx:Message>
            
</template:template>
          
</containers:invoke>
      
</action>
    
</containers:SendDocument>
  
</o:step>

  
<o:completion>
    
<NamedEmailConfirmation>
      
<ConfirmationAddress>
        
<emx:Address>
          
<emx:adrs>mailto:Fred.Derf@usa.xerox.com</emx:adrs>
          
<emx:name>Fred Derf</emx:name>
        
</emx:Address>
      
</ConfirmationAddress>
    
</NamedEmailConfirmation>
  
</o:completion>

</o:job>

D Changelog (Non-Normative)

This section summarizes changes since the previous draft of this document..

E Acknowledgments (Non-Normative)

This model was produced with the participation the following individuals:

F Production Notes (Non-Normative)

This document was encoded in the XMLspec DTD (which has documentation available). The XML sources were transformed using xmlspec.xsl style sheet. The XML Schemas and examples were rendered with the xmlverbatim XSLT stylesheet Emacs was used for editing. The XML was validated using XMLLint (part of the GNOME libxml package) and transformed using XSLTProc—part of the GNOME libxsl package).