W3C

Archive Module

EXPath Candidate Module 12 May 2014

This version:
http://expath.org/spec/archive/20140512
Latest version:
http://expath.org/spec/archive
Previous versions:
http://expath.org/spec/archive/20131205
http://expath.org/spec/archive/20130930
http://expath.org/spec/zip/20101012
Editor:
John Lumley, Saxonica Ltd <john@saxonica.com>
Contributors:
Christian Grün, BaseX GmbH <christian.gruen@gmail.com>
Matthias Brantner, 28msec GmbH <matthias.brantner@28msec.com>
Florent Georges, H2O Consulting

This document is also available in these non-normative formats: XML.


Abstract

This proposal provides an API for XPath 2.0 and XPath 3.0 to handle archive data (i.e. collected and possibly compressed sets of files and directories). It defines extension functions to process data from and to such archives files, including creation, determining and setting properties, listing and extracting contents and adding and updating entries. It has been designed to be compatible with XQuery 1.0 and XSLT 2.0, as well as any other XPath 2.0 usage. Some additional features for use in XPath 3.0 are also defined.

Table of Contents

1 Status of this document
2 Introduction
    2.1 Namespace conventions
    2.2 Error management
    2.3 Archive representation
    2.4 Archive types
    2.5 Optional interfaces
3 Use cases
    3.1 Creating a simple EPUB document
    3.2 Examining a JAR file
4 Describing archives and entries
    4.1 Archive properties and options
    4.2 Entry descriptions
5 Loading and saving archives
6 Information about an archive and its contents
    6.1 arch:options
    6.2 arch:entry-names
    6.3 arch:entries
7 Extracting entries from an archive
    7.1 arch:extract-binary
    7.2 arch:extract-text
8 Updating entries in an archive
    8.1 arch:delete
    8.2 arch:update
9 Creating an archive
    9.1 arch:create
10 Creating and extracting complete archives from and to file systems
    10.1 arch:from-files
    10.2 arch:to-files
11 Convenience functions
    11.1 arch:text
    11.2 arch:xml
12 Functions using XSLT3.0 map() type
    12.1 Using map types to describe entries and options
        12.1.1 Archive property maps
        12.1.2 Entry property maps
    12.2 arch:options-map
    12.3 arch:entries-map
    12.4 arch:extract-map
    12.5 arch:extract-binary-map
    12.6 arch:extract-text-map
    12.7 arch:create-map
    12.8 arch:update-map
    12.9 arch:delete-map

Appendices

A References
B Summary of error conditions


1 Status of this document

This document is in an interim draft stage. Comments are welcomed at public-expath@w3.org mailing list (archive).

2 Introduction

2.1 Namespace conventions

The module defined by this document defines several functions, all contained in the namespace http://expath.org/ns/archive. In this document, the arch prefix, when used, is bound to this namespace URI.

Error codes are defined in the same namespace (http://expath.org/ns/archive), and in this document are displayed with the same prefix, arch.

Note:

This follows the suggestion (in late August 2013) for a coherent naming standard in EXPath modules.

Binary file I/O, to read and write complete archives to files, uses facilities defined in [EXPath File], which defines functions in the namespace http://expath.org/ns/file. In this document, the file prefix, when used, is bound to this namespace URI.

Manipulation of binary data itself can employ functions from [EXPath Binary], which defines functions in the namespace http://expath.org/ns/binary. In this document, the bin prefix, when used, is bound to this namespace URI.

2.2 Error management

Error conditions are identified by a code (a QName.) When such an error condition is reached in the evaluation of an expression, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error() had been called.) The namespace of the code follows that of the module within whose processing the error occurs, i.e. http://expath.org/ns/archive for errors in archive manipulation, http://expath.org/ns/file for errors in file operations and http://expath.org/ns/binary for errors in processing binary data.

2.3 Archive representation

Archives in this module are represented principally as items of type xs:base64Binary, i.e. in their basic binary (byte sequence) forms.

Archives are treated as being arranged structurally as a description of overall options of the archive and a sequence of named entries. Each entry has:

  • A name, which is treated as a sequence of Unicode characters. In many cases the solidus character (/) is used to imply the entries being logically arranged in positions within a directory tree, but this is not mandatory.

  • A set of properties, denoting at least the uncompressed size of the entry, archive internal properties for the entry, such as the compression method used on the stored data and other indications such as the date of last modification.

  • Data, treated as (possibly null) binary data.

It is most common that archives are considered to be arranged logically as directories, using the entry names to denote paths and file names (e.g. tests/qt3/archive/main.xml) In such circumstances, archives may contain entries to represent the directories themselves (e.g. tests/qt3/archive/) presumably with no data. [This could be used such that full extraction of an archive to a file system generates empty output directories for example.] This specification makes no distinction between these two cases – if an archive has an empty 'directory' entry it will be treated similarly to any other 'file' entry. Semantic intrepretation of entry names as files in directory trees is an application issue.

Note:

Behaviour when entries with duplicate names are detected in an archive is implementation dependent. Nevertheless, if an error is not thrown, only one entry should be returned when reading. Implementations must not write duplicate entries in result archives.

2.4 Archive types

The module is designed to be able to support a number of different types of archive, providing a coherent access mechanism.

The following archive types are required to be supported:

  • [ZIP]: (which also covers derivative archive formats, such as JAR or OpenDocument.)

  • [GZIP] : A compressed archive of a sequence of files

    Note:

    Within GZIP names of entries (original file names) are optional, on a per-file basis, so special measures may need to be taken to handle 'unnamed' sections.

Specific issues arise from i) archives used in streaming situations, where the internal manifests of the archives cannot be completed until all data is written, ii) archives where the order of entries is important, such as JAR, where the mainfest entries need to be first.

Note:

Currently there are no proposals within this module to cover encrypted archives.

2.5 Optional interfaces

This module defines two distinctly different interface schemes for reporting on and manipulating archive data. The first uses XML-structured trees to describe entries, their names and their properties, leaving (binary) data described in separate arguments to or results from the functions defined. All conformant implementations must support this interface.

An alternative interface, using the proposed XPath3.0 map() type (see 12 Functions using XSLT3.0 map() type), may be supported by an implementation. This significantly increases the coherence of the connection between entries and their data (as binary data can be the 'value' of a map entry), at the minor cost of having to specify entry order for those archive usages which are order sensitive (e.g. EPUB). This map interface can co-exist with the XML-structured one.

3 Use cases

Development of this specification was driven by requirements which some XML developers regularly encounter in examining or generating data which is presented in archival forms. Some typical use cases include:

3.1 Creating a simple EPUB document

An [EPUB] document is a collection of content sections, written in XHTML, with a metadata descriptor (usually the content.opf file) and a navigation description (usually the toc.ncx file), all collected together and potentially compressed in a ZIP format. A simple example of creating such a document in XQuery is:

arch:create(
    (
      "mimetype",
      "META-INF/container.xml",
      "OEBPS/content.opf",
      "OEBPS/Text/title.xhtml",
      "OEBPS/Text/chap01.xhtml",
      "OEBPS/toc.ncx"
    ),
    (
      content:mimetype(),
      content:metainf(),
      content:oebps-content(),
      content:title(),
      content:chapter(),
      content:toc()
    )
  )

The user-supplied XQuery function content:mimetype() returns the appropriate mimetype description for the EPUB document as a base64-encoding of a string ("application/epub+zip"). Each of the other content:*() functions generates a serialized form of the appropriate XML structure again in a base64 encoding, e.g.:

declare function content:title() as xs:base64Binary
{
  bin:encode-string(fn:serialize(
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <title>Title Page</title>
    </head>
    
    <body>
      <div>
        <h2 id="heading_id_2">Sample Book</h2>
    
        <h2 id="heading_id_3">A Sample .epub Book</h2>
    
        <h3 id="heading_id_4">Title Page</h3>
      </div>
    </body>
    </html>
  )))
};

For an EPUB document the mimetype entry must be uncompressed (so effectively it can be read by simple string searching), but other entries may be compressed.

3.2 Examining a JAR file

JAR files contain class code and definitions for Java classes, in entries whose names are path/classname.class. Local classes (classes defined within a class) have separate code entries with a classname outerclass$innerclass. To find all the main package-qualified classes the following XPath should suffice:

for $e in arch:entry-names(file:read-binary("lib/saxon9-sql.jar"))[ends-with(.,'.class') and not(contains(.,'$'))] 
  return replace(replace($e,'\.class$',''),'/','.')
=> 
   "net.sf.saxon.option.sql.SQLClose", 
   "net.sf.saxon.option.sql.SQLColumn", 
   "net.sf.saxon.option.sql.SQLConnect",
   ....,
   "net.sf.saxon.option.sql.SQLUpdate" 

4 Describing archives and entries

The properties of overall archives and individual entries at the XDM level are described by small structured elements, with optional information attached. In common with description of serialization parameters, these i) use child elements as the property key and ii) place scalar values as the @value attribute of that child.

4.1 Archive properties and options

Archive options and properties are described as a structured element (element(arch:options)) with the following child elements, all of whose values are described in their @value attribute:

  • arch:format: the type of the archive, e.g. "zip". This is mandatory.

  • arch:algorithm: the default compression used in the archive, e.g. "deflate".

Other attributes may be dependent upon the type of the archive and the implementation.

4.2 Entry descriptions

Entries within the archive can be accessed by name (xs:string) or a structured element (element(arch:entry)). In the latter case the entry name is the value of the @value attribute of the arch:name child.

When describing an existing entry in an archive, element(arch:entry) may be returned with the following (optional) children, all of whose values are described in the @value attribute:

  • arch:name: the (path) name of the entry. REQUIRED

  • arch:size: the original file size of the entry.

  • arch:compressed-size: the compressed file size of the entry, i.e. the number of bytes it occupies in the archive.

  • arch:last-modified: the date of last modification of this entry, in xs:dateTime notation.

  • arch:compression-level: an indicator of the level of (lossless?) compression.

When used to create or update an entry in an archive, element(arch:entry) may also have the following (optional) children:

  • arch:name: the (path) name of the entry. REQUIRED

  • arch:last-modified: the date of last modification to be written on this entry, in xs:dateTime notation.

  • arch:compression-level: the level of (lossless?) compression to be used in writing the entry into the archive.

  • arch:encoding: the encoding to be used for converting textual items to a byte sequence, prior to possible compression and writing to the archive. The only values which every implementation is required to recognize are utf-8 and utf-16

(In writing actions, unknown children are ignored. In the case of duplicate children, the value of the first child is taken.)

5 Loading and saving archives

This module defines no specific functions for reading and writing archives from files, as distinct from their binary data. The EXPath File Module [EXPath File] provides two suitable functions to do this:

Note:

The functions detailed in 10.2 arch:to-files and 10.1 arch:from-files may be used to transfer between file system directory trees and archives in a single operation.

6 Information about an archive and its contents

6.1 arch:options

Summary

Returns a description of the type and properties of a given archive.

Signature

arch:options($archive as xs:base64Binary) as element(arch:options)*

Rules

The description is returned as an element <arch:options> with an unordered sequence of child elements describing the details. The following are currently supported:

  • arch:format: format of this archive
  • arch:algorithm: the compression algorithm that was used.

If the archive format supports a compression algorithm varying on a per-entry basis, and more than one algorithm has been used in the archive, mixed is returned for arch:algorithm.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Examples

Finding the properties of the archive stored in a file located at $uri:

arch:options(file:read-binary($uri))
=> <arch:options>
     <arch:format value="ZIP"/>
     <arch:algorithm value="deflate"/>
   </arch:options>

6.2 arch:entry-names

Summary

Returns the entry names for all the entries found within the archive as a sequence of string values in the order in which they appear in the archive.

Signature

arch:entry-names($archive as xs:base64Binary) as xs:string*

Rules

Returns the entry names for all the entries found within the archive as a sequence of string values in the order in which they appear in the archive.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

6.3 arch:entries

Summary

Returns the set of entry descriptors for all the entries found within the archive.

Signature

arch:entries($archive as xs:base64Binary) as element(arch:entry)*

Rules

Each descriptor is an element <arch:entry> whose text value is the path of the file within the archive. For more details of this structure see 4.2 Entry descriptions.

The entries are returned in the order in which they encountered serially within the archive.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Notes

There may be a case for providing a sorted version, probably using some form of collation.

Examples

Finding the entries of the archive stored in a file located at $uri:

arch:entries(file:read-binary($uri))
=> <arch:entry>
      <arch:name value="lumley.jpg"/>
      <arch:size value="2194"/>
      <arch:compressed-size value="652"/>
      <arch:last-modified value="2013-07-18T11:22:12"/>
   </arch:entry>
   <arch:entry size="84983" compressed-size="84872" last-modified="2009-03-23T11:15:06">lumley.jpg</arch:entry>
   <arch:entry size="10058" compressed-size="1381" last-modified="2013-08-06T13:14:08">tests/qt3/binary/binary.xml</arch:entry>
     

Summing the size of the apparent XML files in the previous example:

sum(arch:entries(file:read-binary($uri))[ends-with(arch:name/@value,'.xml')]arch:size/@value)
=> 10058
     

7 Extracting entries from an archive

The module does not attempt to discern the 'type' of an entry (such as 'text', 'XML', 'raw-binary'), leaving that to the programmer. Two forms of reading result are supported: raw binary (xs:base64Binary) and decoded text (xs:string).

7.1 arch:extract-binary

Summary

Returns the sequence of requested entries from the archive as binary data.

Signature

arch:extract-binary($archive as xs:base64Binary,
$entries as xs:string*) as xs:base64Binary*

Rules

Returns as binary data each entry in the archive $archive that corresponds to the entry name input, in sequence.

The entries must be returned in the order corresponding to that of the entries requested in $entries, not in the order in which they may exist in the archive.

Multiple requests for the same entry will be honoured, with copies of the entry appearing in corresponding multiple locations in the output sequence .

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

There have been suggestions for a signature arch:extract-binary($archive as xs:base64Binary) returning all the entries. In the absence of maps in the return type, this does not make sense, since the entries are totally unlabelled, and to get anything meaningful, a parallel call on arch:entries() would be required.

Examples

Returning the binary data for an entry in the archive stored in a file located at $uri:

arch:extract-binary(file:read-binary($uri),'build.xml')
=> stuff
     

7.2 arch:extract-text

Summary

Returns the sequence of requested entries from the archive as strings. If $encoding is specified the strings are decoded appropriately, otherwise UTF-8 encoding is assumed.

Signatures

arch:extract-text($archive as xs:base64Binary,
$entries as xs:string*) as xs:string*
arch:extract-text($archive as xs:base64Binary,
$entries as xs:string*,
$encoding as xs:string) as xs:string*

Rules

Returns as a string each entry in the archive $archive that corresponds to the entry name input, in sequence.

If $encoding is specified the strings are decoded appropriately, otherwise UTF-8 encoding is assumed.

The entries must be returned in the order corresponding to that of the entries requested in $entries, not in the order in which they may exist in the archive.

Multiple requests for the same entry will be honoured, with copies of the entry appearing in corresponding multiple locations in the output sequence .

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:unknown-encoding] is raised if the encoding requested is unknown or unsupported.

[arch:decoding-error] is raised if there was an error in decoding the entry.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

This function should be equivalent to the use of arch:extract-binary() and the function bin:decode-string() from [EXPath Binary]:

arch:extract-binary($archive,$entries) ! bin:decode-string(.,$encoding) [XPath 3.0]
for $b in arch:extract-binary($archive,$entries) return bin:decode-string($b,$encoding) [XPath
        2.0]

Further conversion into XML can be achieved using the XPath3.0 function fn:parse-XML() on each of the returned strings.

There have been suggestions for a signature arch:extract-text($archive as xs:base64Binary) returning all the entries. In the absence of maps in the return type, this does not make sense, since the entries are totally unlabelled, and to get anything meaningful, a parallel call on arch:entries() would be required.

Examples

Returning the text data for an entry in the archive stored in a file located at $uri:

arch:extract-text(file:read-binary($uri),'build.xml','UTF-8')
=> stuff
     

8 Updating entries in an archive

There are two atomic actions available to change entries within an archive: complete deletion of an entry, or complete updating (overwriting) of that entry – the latter adds new entries when the given name does not already exist in the archive

8.1 arch:delete

Summary

Returns an archive with the given entries deleted.

Signature

arch:delete($archive as xs:base64Binary,
$entries as xs:string*) as xs:base64Binary

Rules

Returns an archive of the same format as $archive with all the entries named in $entries deleted.

The relative order of the remaining entries within the archive is preserved.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

Duplicate entries in $entries are ignored.

If $entries is the empty sequence, the original archive shall be returned.

Error Conditions

[arch:unknown-entry] is raised if an entry requested for deletion does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

Whilst the uncompressed entries remaining after deletion should of course be the same size and content as those before deletion, depending upon the (lossless) compression algorithm used, the compressed sizes and content might not be. In the absence of a special check, in these circumstances $archive may not be identical to arch:delete($archive,()). This needs discussion.

Examples

Deleting the entries of the archive stored in a file located at $uri:

arch:entries(arch:delete(file:read-binary($uri),'lumley.jpg'))
=> <arch:entry size="2194" compressed-size="652" last-modified="2013-07-18T11:22:12">build.xml</arch:entry>
   <arch:entry size="10058" compressed-size="1381" last-modified="2013-08-06T13:14:08">tests/qt3/binary/binary.xml</arch:entry>
     

8.2 arch:update

Summary

Returns an archive with each of the given entries in $entries updated to the corresponding values in the sequence $new. If an entry is not found, a new entry is added to the end of the archive.

Signatures

arch:update($archive as xs:base64Binary,
$entries as xs:string*,
$new as xs:base64Binary*) as xs:base64Binary
arch:update($archive as xs:base64Binary,
$entries as xs:string*,
$new as xs:base64Binary*,
$last-modified as xs:dateTime) as xs:base64Binary

Rules

Returns an archive of the same format as $archive with each of the given entries in $entries updated to the corresponding value in the sequence $new. If an entry is not found, a new entry for it is added to the end of the archive.

The relative order of all the existing and replaced entries within the archive is preserved. New entries appear at the end of the archive in the order in which they were specified in the call.

If specified, and the format supports it, the last-modified date for each of the updated entries will be set to $last-modified. In the absence of such a parameter, it is implementation-dependent whether last-modified information will be written on the updated entries. If such default last-modification is written, it should be comparable to the value of fn:current-dateTime() in an XSLT environment.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

The compression methods of the updated entries shall be preserved.

When duplicate names appear in the entry list, the value of the entry in the resulting archive will be that of the value of $new corresponding to the last matching entry name.

Error Conditions

[arch:entry-data-mismatch] is raised if count($entries) ne count($new).

[arch:read-error] is raised if there was an unspecified problem in reading or creating the archive.

9 Creating an archive

New archives can be created in empty or filled states.

9.1 arch:create

Summary

Returns a new archive with each of the given entries in $entries set to the corresponding values in the sequence $new.

Signatures

arch:create($entries as xs:string*, $new as xs:base64Binary*) as xs:base64Binary
arch:create($entries as xs:string*,
$new as xs:base64Binary*,
$options as element(arch:options)) as xs:base64Binary

Rules

Returns an archive of format specified by $options with each of the given entries in $entries set to the corresponding value in the sequence $new.

The relative order of new entries within the archive follows that of the input.

Content provided for any entry considered to be a directory is ignored.

When duplicate names appear in the entry list, the value of the entry in the resulting archive will be that of the value of $new corresponding to the last matching entry name.

Error Conditions

[arch:entry-data-mismatch] is raised if count($entries) ne count($new).

[arch:read-error] is raised if there was an unspecified problem in reading or creating the archive.

10 Creating and extracting complete archives from and to file systems

10.1 arch:from-files

Summary

Collects all the binary file contents from $files and writes them into an new archive which is returned.

Signature

arch:from-files($files as xs:string*) as xs:base64Binary

Rules

Collects all the binary file contents from $files and writes them into an new archive which is returned.

All file content is collected in binary mode, with no attempt at any conversion or decoding.

File and directory path names are normalized to use the solidus ('/') path separator.

Directories are written as empty entries.

Error Conditions

Error conditions from [EXPath File] may be raised if there are problems on reading from the filesystem, most noteably:

Notes

This function should be equivalent to the following XSLT function:

<xsl:function name="arch:from-files" as="xs:base64Binary">
    <xsl:param name="files" as="xs:string*"/>
    <xsl:variable name="all" as="xs:string*"
      select="for $f in $files return 
                  if(file:is-dir($f)) 
                  then (for $f1 in file:list($f,true()) return concat($f,$f1)) 
                  else $f"/>
    <xsl:variable name="normalized.names" select="for $n in $all return replace($n,'\\','/')"/>
    <xsl:variable name="content" as="xs:base64Binary*"
      select="for $f in $normalized.names return 
                  if(file:is-dir($f)) 
                  then xs:base64Binary('') 
                  else file:read-binary($f)"/>
    <xsl:sequence select="arch:create($normalized.names,$content)"/>
</xsl:function>

This function may be provided by an XSLT package (which will probably use functions from [EXPath File], and from which appropriate error conditions may be propagated, or caught within the package) or by a purpose-built extension function that may be able to support such an operation within a context of streaming processing.

10.2 arch:to-files

Summary

Extracts all the entries from $archive and writes them into an equivalent tree of directories and files in the filesystem at the current directory.

Signature

arch:to-files($archive as xs:base64Binary) as ()

Rules

Extracts all the entries from $archive and writes them into an equivalent tree of directories and files in the filesystem at the current directory.

All entries are written in binary mode, with no attempt at any conversion or decoding.

Entry names are considered as file paths, with '/' and '\' separators normalized to the path separator for the execution operating system.

Necessary intermediate directories are created.

Error Conditions

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Error conditions from [EXPath File] may be raised if there are problems on writing to the filesystem, most noteably:

  • [file:exists] is raised if the specified path, or any of its parent directories, points to an existing file.
  • [file:no-dir] is raised if the parent directory of an entry does not exist. (This should not happen.)
  • [file:is-dir] is raised if an entry is being written on to an existing directory.
  • [file:io-error] is raised if any other error occurs.
Notes

This function should be equivalent to the following XSLT function:

<xsl:function name="arch:to-files">
    <xsl:param name="archive" as="xs:base64Binary"/>
    <xsl:variable name="entries" select="arch:entries($archive)"/>
    <xsl:variable name="dirs" select="$entries[ends-with(.,'/')]"/>
    <xsl:variable name="required.dirs"
        select="distinct-values(for $r in ($entries except $dirs) return
                                    replace($r,'/[^/]+$','/'))[ends-with(.,'/')]"/>
    <xsl:sequence
        select="for $d in distinct-values(($required.dirs,$dirs)) return 
                    file:create-dir(replace($d,'/$',''))"/>
    <xsl:sequence
        select="for $f in ($entries except $dirs) return 
                    file:write-binary($f,arch:extract-binary($archive,$f))"/>
</xsl:function>

This function may be provided by an XSLT package (which will probably use functions from [EXPath File], and from which appropriate error conditions may be propagated, or caught within the package) or by a purpose-built extension function that may be able to support such an operation within a context of streaming processing.

11 Convenience functions

A small number of convenience functions are defined for common cases of content, specifically to ensure that 'empty' entries (empty binary data) are produced for empty sequences, to ensure coherence between members of the parallel entry name and entry content sequences.

11.1 arch:text

Summary

Encodes a string into binary data using a given encoding, suitable for content data for an entry.

Signatures

arch:text($in as xs:string*) as xs:base64Binary
arch:text($in as xs:string*, $encoding as xs:string) as xs:base64Binary

Rules

The $encoding argument is the name of an encoding. The values for this attribute follow the same rules as for the encoding attribute in an XML declaration. The only values which every implementation is required to recognize are utf-8 and utf-16.

If $encoding is ommitted, utf-8 encoding is assumed.

If the value of $in is the empty sequence, the function returns an empty binary data. This is unlike bin:encode-string(), which will return an empty sequence.

Error Conditions

[arch:unknown-encoding] is raised if $encoding is invalid or not supported by the implementation.

[error.encoding]is raised if there is an error or malformed input during encoding the string. Additional information about the error may be passed through suitable error reporting mechanisms – this is implementation-dependant.

11.2 arch:xml

Summary

Encodes the serialization of an XML tree into binary data using a given encoding, suitable for content data for an entry.

Signatures

arch:xml($args as item()*) as xs:base64Binary
arch:xml($args as item()*,
$params as element(output:serialization-parameters)?) as xs:base64Binary
arch:xml($args as item()*,
$params as element(output:serialization-parameters)?,
$encoding as xs:string) as xs:base64Binary

Rules

The single-argument version of this function has the same effect as the two-argument version called with $params set to an empty sequence. This in turn is the same as the effect of passing an output:serialization-parameters element with no child elements.

The $params argument is used to identify a set of serialization parameters. These are supplied in the form of an output:serialization-parameters element, having the format described in Section 3.1 Setting Serialization Parameters by Means of a Data Model Instance.

The $encoding argument is the name of an encoding. The values for this attribute follow the same rules as for the encoding attribute in an XML declaration. The only values which every implementation is required to recognize are utf-8 and utf-16.

If $encoding is ommitted, utf-8 encoding is assumed.

Error Conditions

[arch:unknown-encoding] is raised if $encoding is invalid or not supported by the implementation.

[error.encoding]is raised if there is an error or malformed input during encoding the string. Additional information about the error may be passed through suitable error reporting mechanisms – this is implementation-dependant.

Notes

This function is equivalent to arch:text(fn:serialize($args,$params),$encoding).

12 Functions using XSLT3.0 map() type

The map type (map(xs:untypedAtomic,item()*)) proposed for XSLT3.0 can increase the coherence of the functions in this module significantly, mainly by retaining the structured connection between the entry name and its properties and content. In addition the properties of the overall archive (and its defaults for new entries) can similarly be defined in a single map.

This section defines optional parallel functions to those above using maps for arguments or results. In general these functions have separate names (e.g. arch:entries-map()) derived from a consistent suffix ('-map') attached to the standard, element-based form.

Note:

map:keys($map as map(*)) as xs:anyAtomicType* returns the keys that are present in a map, in unpredictable order. This means that if order within an archive is important (either in extraction or updating) other mechanisms, such as the position property, are needed to track or set that order.

Note:

It should be possible to implement all the functions in this section as user-defined XSLT3.0 functions using the library described above.

Note:

FOR DISCUSSION. In general when using maps for denoting the entries to be manipulated, the arguments could be considered to be a (possibly empty) sequence of maps that are treated as if concatentated. [THIS NEEDS THOUGHT ABOUT OVERWRITING/MERGING COMMON KEYS]. In this draft the arguments are single maps.

12.1 Using map types to describe entries and options

An archive is described as a map name -> properties, where the properties of each entry themselves are represented as a further map. The 'content', i.e. the real data, of an archive entry is described by the content property of that map. Thus a set of archive entries has type map(xs:string, map(xs:string,item()*))

Support for similar approaches using other map representations, such as [JSONiq] objects may be implementation dependent.

12.1.1 Archive property maps

The properties of an archive itself, as opposed to its entries, can be described or defined with a map with the following entries:

PropertyTypeMeaning
formatxs:stringThe format of this archive
compressionxs:stringThe compression algorithm used for compressing the archive.

Note:

Using a reserved name within an overall map (such as arch:options) would allow the options/properties for an archive to be stored alongside the entries themselves.

12.1.2 Entry property maps

Entries within the archive can be also be accessed or described by entries in a map (map(xs:string,map(xs:string,item()*))). In this case the map key gives the (path)name of the archive entry (e.g. build/build-j.xml) and the value is a map of the properties of that entry.

The keys are described in the following table, and specific use is described under each of the functions:

PropertyTypeMeaning
sizexs:integerThe original file size of the entry
compressed-sizexs:integerThe compressed file size of the entry, i.e. the number of bytes it occupies in the archive
last-modifiedxs:dateTimeThe date of last modification of this entry
compression-levelxs:stringAn indicator of the level of (lossless?) compression
contentxs:base64Binary or xs:stringThe value of the entry read from the archive. This will only be set from arch:entries-map() if $return-content is requested in the call.
encodingxs:stringThe encoding to be used for converting textual items to or from a byte sequence. The absence of such an entry implies binary content. The only values which every implementation is required to recognize are utf-8 and utf-16
positionxs:integerThe position of the entry in the archive, starting at 1.

12.2 arch:options-map

Summary

Returns a description of the type and properties of a given archive as a map.

Signature

arch:options-map($archive as xs:base64Binary) as map(xs:string,item()?)

Rules

The description is returned as a map map(xs:string,item()?) with entries describing the details. The following are currently supported:

  • format: format of this archive
  • compression: the compression algorithm that was used.

If the archive format supports a compression algorithm varying on a per-entry basis, and more than one algorithm has been used in the archive, mixed is returned for the compression entry.

Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Examples

Finding the properties of the archive stored in a file located at $uri:

arch:options-map(file:read-binary($uri))
=> map {'format' :'zip', 'compression' : 'deflate'}

12.3 arch:entries-map

Summary

Returns the entry descriptors for all the entries found within the archive as a map, optionally each with their content.

Signatures

arch:entries-map($archive as xs:base64Binary) as map(xs:string,map(xs:string,item()*))
arch:entries-map($archive as xs:base64Binary,
$return-content as xs:boolean) as map(xs:string,map(xs:string,item()*))

Rules

Keys to the returned map are the entry (path) names.

The value for each map entry is a map describing the properties of that entry. For more details of this structure see 12.1.2 Entry property maps. The specific properties returned are:

  • size
  • compressed-size
  • last-modified
  • position
  • content: this will be set only if $return-content is defined and equals true(). The type will be xs:base64Binary.
Error Conditions

[arch:read-error] is raised if there is an unspecified problem in reading the archive.

Notes

As the returned order of keys from map:keys() is not defined and can be implementation-dependant, the results of the function arch:entry-names(xs:base64Binary) as xs:string* can be used as a key sequence to iterate through this map, or a sort based on the position property.

Using $return-content makes it possible to return a complete archive in a single call. Archive options can be added through a compound shown in the examples.

Examples

Finding the entries of the archive stored in a file located at $uri:

arch:entries-map(file:read-binary($uri))
=> map{ 
  "build.xml" : map{ "size" : 2194, "compressed-size" : 652, "last-modified" : "2013-07-18T11:22:12"},
  "lumley.jpg" : map{ "size" : 84983, "compressed-size" : 84872, "last-modified" : "2009-03-23T11:15:06"},
  "tests/qt3/binary/binary.xml" : map{ "size" : 10058, "compressed-size" : 1381, "last-modified" : "2013-08-06T13:14:08"}}
     

Counting the number of apparent XML files in the previous example:

count(map:keys(arch:entries-map(file:read-binary($uri)))[ends-with(.,'.xml')])
=> 2
     

Returning an archive complete with options:

map:new((map:new('arch:options',arch:options-map($archive)),arch:entries-map($archive)))
=> map{
  "arch:options" : map{ "format" : "ZIP", "compression" : "flat" },
  "build.xml" : map{ "size" : 2194, "compressed-size" : 652, "last-modified" : "2013-07-18T11:22:12"},
  "lumley.jpg" : map{ "size" : 84983, "compressed-size" : 84872, "last-modified" : "2009-03-23T11:15:06"},
  "tests/qt3/binary/binary.xml" : map{ "size" : 10058, "compressed-size" : 1381, "last-modified" : "2013-08-06T13:14:08"}}
     

12.4 arch:extract-map

Summary

Returns a copy of $entries with the content entries set to binary or decoded string data for the appropriate entry in the archive.

Signature

arch:extract-map($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?))) as map(xs:string,map(xs:string,item()?))

Rules

Return a copy of $entries with the content property of each entry set to binary or decoded string data for the appropriate entry in the archive.

The map entries in $entries define whether binary or decoded string data is to be returned. (For details of properties see 12.1.2 Entry property maps.) The only relevant property is:

  • encoding: if this is set, then the entry will be decoded from binary to xs:string according to the named encoding. If absent, then the type will be xs:base64Binary.

The value for each map entry in the return is the original entry from $entries plus an additional or replaced property:

  • content: the type will be xs:string or xs:base64Binary dependant upon the presence of the encoding property.

The behaviour of this function is defined by equivalent XPath:

map:new(for $k in map:keys($entries) 
   return 
     let $a := $entries($k),
         $text := map:contains($a,'encoding'),
         $encoding := ($a('encoding'),'UTF-8')[1],
         $data := arch:extract-binary($archive,$k) // error if not found
     return 
         map:entry($k,
             map:new(($a,
               map:entry('content',if($text) bin:decode-string($data,$encoding) else $data)
               ))
       )
     
Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:decoding-error] is raised if there was an error in decoding an entry.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

As the original $entries are returned in the result map, with content added, other information , such as position is retained.

Examples

To collect all the XML entries as XML:

let $archive := file:read-binary($uri)
    $entries := arch:entries-map($archive),
    $xml-names := map:keys($entries)[ends-with(.,'.xml')],
    $get := map:new($xml-names ! map:entry(.,map:entry('encoding','UTF-8'))),
    $content := arch:extract-map($archive,$get)
return
    $xml-names ! fn:parse-XML($content(.)('content'))
     

12.5 arch:extract-binary-map

Summary

Returns the sequence of requested entries from the archive as binary data.

Signature

arch:extract-binary-map($archive as xs:base64Binary,
$entries as map(xs:string,item()*)) as xs:base64Binary*

Rules

Returns as binary data each entry in the archive $archive that corresponds to map:keys($entries), in sequence.

Any information in the values of each entry of $entries is ignored.

Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

Collection of all the entries as binary data can also be accomplished using arch:entries-map($archive,true()) and collecting the 'content' entry from each of the returned maps.

12.6 arch:extract-text-map

Summary

Returns the sequence of requested entries from the archive as decoded string data.

Signatures

arch:extract-text-map($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?))) as xs:string*
arch:extract-text-map($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()?)),
$encoding as xs:string) as xs:string*

Rules

Returns as decoded string data each entry in the archive $archive that corresponds to map:keys($entries), in sequence.

If $encoding is specified, or the property encoding appears in the entry in $entries, the strings are decoded according to that encoding, otherwise UTF-8 encoding is assumed.

The behaviour of this function is defined by equivalent XPath:

for $k in map:keys($entries) 
   return 
     let $a := $entries($k),
         $thisEncoding := ($a('encoding'),$encoding,'UTF-8')[1],
         $data := arch:extract-binary($archive,$k) // error if not found
     return 
         bin:decode-string($data,$thisEncoding)
     
Error Conditions

[arch:unknown-entry] is raised if an entry requested does not exist in this archive.

[arch:unknown-encoding] is raised if an encoding requested is unknown or unsupported.

[arch:decoding-error] is raised if there was an error in decoding an entry.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

12.7 arch:create-map

Summary

Returns a new archive with each of the given entries named as a key in $entries set to the corresponding value in $entries($key)('content').

Signatures

arch:create-map($entries as map(xs:string,map(xs:string,item()*))) as xs:base64Binary
arch:create-map($entries as map(xs:string,map(xs:string,item()*)),
$options as map(xs:string,item()*)) as xs:base64Binary

Rules

Returns an archive of format specified by $options with each of the given entries named as a key in $entries set to the corresponding value in $entries($key)('content'), and with other properties defined by $entries($key)(*) or $options.

The map $options can contain properties both for the archive itself, and defaults for each entry. Relevant properties for the archive (see also 12.1.1 Archive property maps) are:

  • format
  • compression

Relevant properties for entries (see also 12.1.2 Entry property maps) are:

  • compression-level
  • last-modified
  • position: position order for entries. These need not be contiguous, but should not be duplicated.
  • encoding. If this is set, then the content entry will be encoded from xs:string to binary according to the named encoding. If absent, then content is assumed of type xs:base64Binary. The only values which every implementation is required to recognize are utf-8 and utf-16.
  • content: the content to write, treated either as xs:string or xs:base64Binary, dependent upon encoding.

The relative order of entries within the archive follows that of the position property, if specified, followed by all those lacking such a property, in an implementation-dependant order. The specific ordering is equivalent to:

<xsl:variable name="$keys" select="map:keys($entries)"/>
<xsl:variable name="positioned" as="xs:string*">
  <xsl:perform-sort select="$keys[map:contains($entries(.),'position']">
    <xsl:sort select="$entries(.)('position')"/>
  </xsl:perform-sort>
</xsl:variable>
<xsl:for-each select="$positioned, $keys[not(.=$positioned)]">
      .... process ....
</xsl:for-each>
     

If $options is specified, the overall archive properties (and defaults for the entries) are set to those specified in the map.

Error Conditions

[arch:read-error] is raised if there was an unspecified problem in creating the archive.

[arch:duplicate-position] is raised if two or more entries request the same position in the archive.

12.8 arch:update-map

Summary

Returns an archive with each of the given entries in the keys of $entries updated to the corresponding values in the $entries($key)('content') and with other properties defined by $entries($key)(*). If an entry is not found, a new entry is added to the end of the archive.

Signatures

arch:update-map($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()*))) as xs:base64Binary
arch:update-map($archive as xs:base64Binary,
$entries as map(xs:string,map(xs:string,item()*)),
$default.options as map(xs:string,item()*)) as xs:base64Binary

Rules

Returns an archive of the same format as $archive with each of the given entries in the keys of $entries updated to the corresponding values in the $entries($key)('content') and with other properties defined by $entries($key)(*) or $default.options. If an entry is not found, a new entry is added to the end of the archive. Relevant properties (see also 12.1.2 Entry property maps) are:

  • compression-level
  • last-modified
  • position: position order for new (as opposed to updated) entries. These need not be contiguous, but should not be duplicated.
  • encoding. If this is set, then the contententry will be encoded from xs:string to binary according to the named encoding. If absent, then content is assumed of type xs:base64Binary. The only values which every implementation is required to recognize are utf-8 and utf-16.
  • content: the content to write, treated either as xs:string or xs:base64Binary, dependent upon encoding.

If $options is specified, values will be used for the default properties for each entry, which may be overloaded by the property map for each individual entry.

The relative order of all the existing and replaced entries within the archive is preserved. New entries appear at the end of the archive: any which have a position property specified, are ordered according to that property, followed by any others in an implementation-dependent order.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

The compression methods of the updated entries shall be preserved.

Error Conditions

[arch:read-error] is raised if there was an unspecified problem in reading or creating the archive.

[arch:duplicate-position] is raised if two or more entries request the same position in the archive.

Notes

Using the $default map a common compression method, last-modification date and similar can be set for a set of entries, whose minimal map entries are map{"content":=$content}

12.9 arch:delete-map

Summary

Returns an archive with the given entries deleted.

Signature

arch:delete-map($archive as xs:base64Binary,
$entries as map(xs:string,item()*)) as xs:base64Binary

Rules

Returns an archive of the same format as $archive with all the entries named in map:keys($entries) deleted.

The relative order of the remaining entries within the archive is preserved.

The uncompressed content, size and last-modified date of the remaining entries shall be the same as those for those entries before deletion. Compressed sizes may alter.

If $entries is an empty map, the original archive shall be returned.

Any information in the values of each entry of $entries is ignored.

Error Conditions

[arch:unknown-entry] is raised if an entry requested for deletion does not exist in this archive.

[arch:read-error] is raised if there was an unspecified problem in reading the archive.

Notes

Whilst the uncompressed entries remaining after deletion should of course be the same size and content as those before deletion, depending upon the (lossless) compression algorithm used, the compressed sizes and content might not be. In the absence of a special check, implied in the rules,$archive may not be identical to arch:delete-map($archive,map:new()).

A References

EPUB
EPUB 3 Overview. International Digital Publishing Forum. Recommended Specification 11 October 2011.
EXPath File
File Module. Christian Grün and Matthias Brantner, editors. EXPath Candidate Module. 14 June 2012.
EXPath Binary
Binary Module. Jirka Kosek and John Lumley, editors. EXPath Module. 3 December 2013.
F&O 3.0
XPath and XQuery Functions and Operators 3.0. Michael Kay, editor. W3C Candidate Recommendation 21 May 2013.
GZIP
GZIP file format specification version 4.3. L. Peter Deutsch, 1996.
JSONiq
JSONiq – The JSON Query Language. FLWOR Foundation. 2013.
XML Schema 1.1 Part 2
W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. David Peterson et al, editors.W3C Recommendation 5 April 2012.
ZIP
ZIP File Format Specification.PKWare, Version 6.3.3, 1 September 2012.

B Summary of error conditions

arch:read-error
There was an general error in reading the archive
arch:unknown-entry
The specified entry does not exist in this archive.
arch:entry-data-mismatch
The sequence of entry names is not the same length as the sequence of updated values.
arch:unknown-encoding
The specified encoding is not supported.
arch:decoding-error
Error in decoding a string.
arch:duplicate-position
Two entries are requesting the same order position in the archive.

Errors possibly generated by code executed from module [EXPath File]:

file:not-found
The specified path does not exist.
file:exists
The specified path already exists.
file:no-dir
The specified path does not point to a directory.
file:is-dir
The specified path points to a directory.
file:io-error
A generic file system error occurred.