<?xml version="1.0" encoding="UTF-8"?>

<?oxygen RNGSchema="http://www.w3.org/XML/1998/06/schema/xmlspec.rng" type="xml"?>

<spec role="editors-copy" xmlns:fg="http://www.fgeorges.org/xmlspec">
   <header>
      <title>HTTP Client Module</title>
      <subtitle>EXPath Candidate Module 9 January 2010</subtitle>
      <w3c-designation>w3c-designation</w3c-designation>
      <w3c-doctype>Candidate</w3c-doctype>
      <pubdate>
         <day>9</day>
         <month>January</month>
         <year>2010</year>
      </pubdate>
      <publoc>
         <loc href="http://www.expath.org/modules/http-client-20100109.html"/>
      </publoc>
      <altlocs>
         <loc href="http://www.expath.org/modules/http-client-20100109.xml">XML</loc>
      </altlocs>
      <latestloc>
         <loc href="http://www.expath.org/modules/http-client.html"/>
      </latestloc>
      <prevlocs>
         <loc href="http://www.expath.org/modules/http-client-20090413.html"/>
         <loc href="http://www.expath.org/modules/http-client-20090302.html"/>
         <loc href="http://www.expath.org/modules/http-client-20090112.html"/>
      </prevlocs>
      <authlist>
         <author>
            <name>Florent Georges</name>
            <affiliation>H2O Consulting</affiliation>
         </author>
      </authlist>
      <copyright>
         <p/>
      </copyright>
      <abstract>
         <p>
            This proposal provides an HTTP client interface for XPath
            2.0.  It defines one extension function to perform HTTP
            requests, and has been designed to be compatible with XQuery 1.0
            and XSLT 2.0, as well as any other XPath 2.0 usage.
         </p>
      </abstract>
      <status>
         <p>Must be ignored, but is required by the schema...</p>
      </status>
      <langusage>
         <language>langusage</language>
      </langusage>
      <revisiondesc>
         <p>revisiondesc</p>
      </revisiondesc>
   </header>
   <body>
      <div1>
         <head>Introduction</head>
         <p>TODO: Add @href to http:response, taking redirects into account.</p>
         <div2>
            <head>Namespace conventions</head>
            <p>The module defined by this document does define one function in the namespace
                  <code>http://expath.org/ns/http-client</code>. In this document, the
                  <code>http</code> prefix, when used, is bound to this namespace URI.</p>
            <p>Error codes are defined in the namespace <code>http://expath.org/ns/error</code>. In
               this document, the <code>err</code> prefix, when used, is bound to this namespace
               URI.</p>
         </div2>
         <div2>
            <head>Error management</head>
            <p>Error conditions are identified by a code (a <code>QName</code>). When such an error
               condition is reached during the execution of the function, a dynamic error is thrown,
               with the corresponding error code (as if the standard XPath function
                  <code>error</code> had been called).</p>
            <p>Error codes are defined through the spec. For too many reasons to enumerate here, the
               HTTP protocol layer can raise an error. In this case, if the error condition is not
               mentioned explicitly in the spec, the implementation must raise an error with an
               appropriate message <bibref ref="errHC001"/>.</p>
         </div2>
      </div1>
      <div1>
         <head>The <code>http:send-request</code> function</head>
         <p>This module defines an XPath extension function that sends an HTTP request and return
            the corresponding response. It supports HTTP multi-part messages. Here is the signature
            of this function:</p>
         <eg>
<fg:function>http:send-request</fg:function>($request as <fg:type>element(http:request)?</fg:type>,
                  $href as <fg:type>xs:string?</fg:type>,
                  $bodies as <fg:type>item()*</fg:type>) as <fg:type>item()+</fg:type></eg>
         <ulist>
            <item>
               <p><code>$request</code> contains the various parameters of the request, for instance
                  the HTTP method to use or the HTTP headers. Among other things, it can also
                  contain the other param's values: the URI and the bodies. If they are not set as
                  parameter to the function, their value in $request, if any, is used instead. See
                  the following section for the detailed definition of the http:request element. If
                  the parameter does not follow the grammar defined in this spec, this is an error
                     <bibref ref="errHC005"/>.</p>
            </item>
            <item>
               <p><code>$href</code> is the HTTP or HTTPS URI to send the request to. It is an
                  xs:anyURI, but is declared as a string to be able to pass literal strings (without
                  requiring to explicitly cast it to an xs:anyURI).</p>
            </item>
            <item>
               <p><code>$bodies</code> is the request body content, for HTTP methods that can
                  contain a body in the request (e.g. POST). This is an error if this param is not
                  the empty sequence for methods that must be empty (e.g. DELETE). The details of
                  the methods are defined in their respective specs (e.g. <bibref ref="rfc2616"/> or
                     <bibref ref="rfc4918"/>). In case of a multipart request, it can be a sequence
                  of several items, each one is the body of the corresponding body descriptor in
                     <code>$request</code>. See below for details.</p>
            </item>
         </ulist>
         <p>Besides the 3-params signature above, there are 2 other signatures that are convenient
            shortcuts (corresponding to the full version in which corresponding params have been set
            to the empty sequence). They are:</p>
         <eg>
<fg:function>http:send-request</fg:function>($request as <fg:type>element(http:request)</fg:type>) as <fg:type>item()+</fg:type>
<fg:function>http:send-request</fg:function>($request as <fg:type>element(http:request)?</fg:type>,
                  $href as <fg:type>xs:string?</fg:type>) as <fg:type>item()+</fg:type></eg>
      </div1>
      <div1>
         <head>Sending a request</head>
         <p>
            The functions defined in this module make one able to send a request to an
            HTTP server and receive the corresponding response.  Here is how the request
            is represented by the parameters to this function, and how they are used
            to generate the actual HTTP request to send.
         </p>
         <div2>
            <head>The request elements</head>
            <p>
               The <code>http:request</code> element represents all the needed
               information to send the HTTP request.  So it is always possible
               to create such an element that will carry over all the needed info
               for a particular request.  For some of those values though, you
               can use an additional param instead.  For instance, some signatures
               define the parameter <code>$href</code>.  If the value of this parameter
               is not the empty sequence, it will then be used instead of the value
               of the attribute <code>href</code> on the <code>http:request</code>
               element.
            </p>
            <eg>
&lt;http:request method = ncname
              href? = uri
              status-only? = boolean
              username? = string
              password? = string
              auth-method? = string
              send-authorization? = boolean
              override-media-type? = string
              follow-redirect? = boolean
              timeout? = integer>
   (http:header*,
     (http:multipart|
      http:body)?)
&lt;/http:request></eg>
            <ulist>
               <item>
                  <p>
                     <code>method</code> is the HTTP verb to use, as GET, POST, etc.  It is case
                     insensitive
                  </p>
               </item>
               <item>
                  <p>
                     <code>href</code> is the URI the request has to be sent to.  It can be overridden
                     by the parameter <code>$href</code>
                  </p>
               </item>
               <item>
                  <p>
                     <code>status-only</code> control how the response will look like; if it is
                     true, only the status code and the headers are returned, the content is not
                     (no http:body nor http:multipart, nor the interpreted additional value in the
                     returned sequence, see hereafter).
                  </p>
               </item>
               <item>
                  <p>
                     <code>username</code>, <code>password</code>, <code>auth-method</code>
                     and <code>send-authorization</code> are used for authentication (see section
                     below).
                  </p>
               </item>
               <item>
                  <p>
                     <code>override-media-type</code> is a MIME type.  It can be used only with
                     <code>http:request</code>, and will override the Content-Type header returned
                     by the server.
                  </p>
               </item>
               <item>
                  <p>
                     <code>follow-redirect</code> control whether an HTTP redirect is automatically
                     followed or not.  If it is false, the HTTP redirect is returned as the response.
                     If it is true (the default) the function tries to follow the redirect, by sending
                     the same request to the new address (including body, headers, and authentication
                     credentials).  Maximum one redirect is followed (there is no attempt to follow
                     a redirect in response to following a first redirect).
                  </p>
               </item>
               <item>
                  <p><code>timeout</code> is the maximum number of seconds to wait for the server to
                     respond. If this time duration is reached, an error is thrown <bibref
                        ref="errHC006"/>. (TODO: Allow one to ask for an empty sequence
                     instead?)</p>
               </item>
               <item>
                  <p>
                     <code>http:header</code> represent an HTTP header, either in the
                     <code>http:request</code> or in the <code>http:response</code> elements, as
                     defined below.
                  </p>
               </item>
               <item>
                  <p>
                     <code>http:multipart</code> represents a multi-part body, either in a request
                     or a response, as defined below.
                  </p>
               </item>
               <item>
                  <p><code>http:body</code> represents the body, either of a request or a response,
                     as defined below.</p>
               </item>
            </ulist>
            <p>The <code>http:header</code> element represents an HTTP header, either in a request
               or in a response:</p>
            <eg>
&lt;http:header name = string
             value = string/></eg>
            <p>The <code>http:body</code> element represents the body of either an HTTP request or
               of an HTTP response (in multipart requests and responses, it represents the body of a
               single one part):</p>
            <eg>
&lt;http:body media-type = string
           src? = uri
           method? = "xml" | "html" | "xhtml" | "text" | "binary"
             | qname-but-not-ncname
           byte-order-mark? = "yes" | "no"
           cdata-section-elements? = qnames
           doctype-public? = string
           doctype-system? = string
           encoding? = string
           escape-uri-attributes? = "yes" | "no"
           indent? = "yes" | "no"
           normalization-form? = "NFC" | "NFD" | "NFKC" | "NFKD"
             | "fully-normalized" | "none" | nmtoken
           omit-xml-declaration? = "yes" | "no"
           standalone? = "yes" | "no" | "omit"
           suppress-indentation? = qnames
           undeclare-prefixes? = "yes" | "no"
           output-version? = nmtoken>
   any*
&lt;/http:body></eg>
            <p>The <code>media-type</code> is the MIME media type of the body part. It is mandatory.
               In a request it is given by the user and is the default value of the Content-Type
               header if it is not set explicitly. In a response, it is given by the implementation
               from the Content-Type header returned by the server. The <code>src</code> attribute
               can be used in a request to set the body content as the content of the linked
               resource instead of using the children of the <code>http:body</code> element. When
               this attribute is used, only the <code>media-type</code> attribute must also be
               present, and there can be neither content in the <code>http:body</code> element, nor
               any other attribute, or this is an error <bibref ref="errHC004"/>.</p>
            <p>All the attributes, except <code>src</code>, are used to set the corresponding
               serialization parameter defined in <bibref ref="xserial"/>, as defined for the XPath
               2.1 function <code>fn:serialize()</code>
               <bibref ref="fo11"/>. A difference here is that the serialization parameter
                  <code>include-content-type</code> does not make sense, so it is not available on
               the <code>http:body</code> element (its value is always "yes"). Those attributes can
               be given by the user on a request to control the way a part body is serialized. In
               the response, the implementation can, but is not required, to provide some of them if
               it has the corresponding information (some of them do not make any sense in a
               response, therefore they will never be on a response element, for instance
                  <code>output-version</code>).</p>
            <p>The <code>http:multipart</code> element represents an HTTP multi-part request or
               response:</p>
            <eg>
&lt;http:multipart media-type = string
                boundary? = string>
   (http:header*,
    http:body)+
&lt;/http:multipart></eg>
            <p>The <code>media-type</code> attribute is the media type of the whole request or
               response, and has to be a multipart media type (that is, its main type must be
                  <code>multipart</code>). The <code>boundary</code> attribute is the boundary
               marker used to separate the several parts in the message (the value of the attribute
               is prefixed with "<code>--</code>" to form the actual boundary marker in the request;
               on the other way, this prefix is removed from the boundary marker in the response to
               set the value of the attribute).</p>
         </div2>
         <div2>
            <head>Serializing the request content</head>
            <p>If the request can have content (one body or several body parts), it can be specified
               by the <code>http:multipart</code> element, the <code>http:body</code> element,
               and/or the parameter <code>$bodies</code>. For each body, the content of the HTTP
               body is generated as follow.</p>
            <p>Except when its attribute <code>src</code> is present, a <code>http:request</code>
               element can have several attributes representing serialization parameters, as defined
               in <bibref ref="xserial"/>. This spec defines in addition the method
                  <code>'binary'</code>; in this case the body content must be either an
               xs:hexBinary or an xs:base64Binary item, and no other serialization parameter can be
               set besides <code>media-type</code>.</p>
            <p>The default value of the serialization method depends on the <code>media-type</code>:
               it is <code>'xml'</code> if it is an XML media type, <code>'html'</code> if it is an
               HTML media type, <code>'xhtml'</code> if it is <code>application/xhtml+xml</code>,
                  <code>'text'</code> if it is a textual media type, and <code>'binary'</code> for
               any other case.</p>
            <p>When a body element has an empty content (i.e. it has no child node at all) its
               content is given by the parameter <code>$bodies</code>. In a single part request,
               this param must have at most one item. If the body is empty, the param cannot be the
               empty sequence. In a multipart request, <code>$bodies</code> must have as many items
               as there are empty body elements. If there are three empty body elements, the content
               of the first of them is <code>$bodies[1]</code>, and so on. The number of empty body
               elements must be equal to the number of items in <code>$bodies</code>.</p>
         </div2>
         <div2>
            <head>Authentication</head>
            <p>
               HTTP authentication when sending a request is controlled by the attributes 
               <code>username</code>, <code>password</code>, <code>auth-method</code> and
               <code>send-authorization</code> on the element <code>http:request</code>.
               If <code>username</code> has a value, <code>password</code> and
               <code>auth-method</code> must have a value too.  And if any one of the three
               other attributes have been set, <code>username</code> must be set too.
            </p>
            <p>
               The attribute <code>auth-method</code> can be either <code>"Basic"</code> or
               <code>"Digest"</code>, but other values can also be used, in an
               implementation-defined way.  The handling of those attributes must be done
               in conformance to <bibref ref="rfc2617"/>.  If <code>send-authorization</code>
               is true (default value is false) and the authentication method supports
               generating the header <code>Authorization</code> without challenge, the
               request contains this header.  The default value is to send a non-authenticated
               request, and if the response is an authentication challenge, then only send
               the credentials in a second message.
            </p>
         </div2>
      </div1>
      <div1>
         <head>Dealing with the response</head>
         <p>
            After having sent the request to the HTTP server, the function waits for
            the response.  It analyses it and returns a sequence representing this
            response.  This sequence has an <code>http:response</code> element as
            first item, which is followed be an additional item for each body or
            body part in the response.
         </p>
         <div2>
            <head>The result element</head>
            <eg>
&lt;http:response status = integer
                  message = string>
   (http:header*,
     (http:multipart |
      http:body)?)
&lt;/http:response></eg>
            <p>
               This is the first item returned by the function defined in this module.
               The <code>status</code> attribute is the HTTP status code returned by the
               server, and <code>message</code> is the message coming with the status on the
               status line.  The <code>http:header</code> elements are as defined for the
               request, but represent instead the response headers.  The <code>http:body</code>
               and <code>http:multipart</code> elements are also like in the request, but
               <code>http:body</code> elements must be empty.
            </p>
         </div2>
         <div2>
            <head>Representing the result content</head>
            <p>
               Instead of being inserted within the <code>http:response</code> element, the
               content of each body is returned as a single item in the return sequence.
               Each item is in the same order (after the <code>http:response</code> element)
               than the <code>http:body</code> elements.  For each body, the way this item
               is built from the HTTP response is as follow.
            </p>
            <p>
               If the <code>status-only</code> attribute has the value <code>true</code>
               (default is <code>false</code>), the returned sequence will only contain the
               <code>http:response</code> element (with the headers, but also the empty
               <code>http:body</code> or <code>http:multipart</code> elements, as if
               <code>status-only</code> was false), and the following items, representing
               the bodies content are not generated from the HTTP response.
            </p>
            <p>
               For each body that has to be interpreted, the following rules apply in order to
               build the corresponding item.  If the body media type is a text media type, the
               item is a string, containing the body content.  If the media type is an XML
               media type, the content is parsed and the item is the resulting document node.
               If the media type is an HTML type, the content is <emph>tidied up</emph> and
               parsed (this process is implementation-dependant) and the item is the resulting
               document node.  If this is a binary media type, the content is returned as a
               base64Binary item.  From the previous rules, a result item can then be either a
               document node (from XML or HTML), a string or a base64Binary.
            </p>
            <p>When the type of a part is either XML or HTML, its body has to be parsed into a
               document node. If there is any error when parsing the content, an error is raised
               with an appropriate message <bibref ref="errHC002"/>.</p>
            <p>If the attribute <code>override-media-type</code> is set on the request, its value is
               used instead of the Content-Type returned by the HTTP server. If the Content-Type of
               the response is a multipart type, the value of <code>override-media-type</code> can
               only be a multipart type, or <code>application/octet-stream</code> (to get the raw
               entity as a binary item). If it is not, this is an error <bibref ref="errHC003"
               />.</p>
            <!--p>
               TODO: we should be able to select the part to use (for instance, for
               a multipart text and xhtml, always select xhtml...)
            </p-->
         </div2>
      </div1>
      <div1>
         <head>Content types handling</head>
         <p>
            In both requests and responses, MIME type strings are used to choose the way the
            entity content has to be respectively serialized or parsed.  Four different kinds
            of type are defined here, which are used in the above text about sending request
            and receiving response.  The intent is to provide the spirit of the entity content
            handling regarding its content type, but an implementation is encouraged to deviate
            from those rules if it is obvious that a particular type should be treated in a
            specific way (normally, that would be the case only to treat a binary type as
            another type).
         </p>
         <ulist>
            <item>
               <p>
                  An XML media type has a MIME type of <code>text/xml</code>,
                  <code>application/xml</code>, <code>text/xml-external-parsed-entity</code>,
                  or <code>application/xml-external-parsed-entity</code>, as defined in
                  <bibref ref="rfc3023"/> (except that <code>application/xml-dtd</code> is
                  considered a text media type).  MIME types ending by <code>+xml</code> are
                  also XML media types.
               </p>
            </item>
            <item>
               <p>
                  An HTML media type has a MIME type of <code>text/html</code>.
               </p>
            </item>
            <item>
               <p>
                  Text media types are the remaining types beginning with <code>text/</code>.
               </p>
            </item>
            <item>
               <p>
                  Binary types are all the other types.  An implementation can treat some of
                  those binary types as either an XML, HTML or text media type if it is more
                  appropriate (this is implementation-defined).
               </p>
            </item>
         </ulist>
      </div1>
   </body>
   <back>
      <div1>
         <head>References</head>
         <p>
            The structure of most of the elements and most of the attributes used in this
            candidate are inspired from the corresponding step in <bibref ref="xproc"/>.
         </p>
         <blist>
            <bibl id="tidy" key="HTML Tidy">
               <loc href="http://tidy.sf.net/">HTML Tidy Library Project</loc>. SourceForge project.
            </bibl>
            <bibl id="rfc1521" key="RFC 1521">
               <loc href="http://www.ietf.org/rfc/rfc1521.txt">RFC 1521: MIME (Multipurpose Internet
                  Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of
                  Internet Message Bodies</loc>.  N. Borenstein, N. Freed, editors. Internet
               Engineering Task Force. September, 1993.
            </bibl>
            <bibl id="rfc2616" key="RFC 2616">
               <loc href="http://www.ietf.org/rfc/rfc2616.txt">RFC 2616: Hypertext Transfer Protocol
               — HTTP/1.1</loc>.  R. Fielding, J. Gettys, J. Mogul, et. al., editors. Internet
               Engineering Task Force. June, 1999.
            </bibl>
            <bibl id="rfc2617" key="RFC 2617">
               <loc href="http://www.ietf.org/rfc/rfc2617.txt">RFC 2617: HTTP Authentication: Basic
               and Digest Access Authentication</loc>.  J. Franks, P. Hallam-Baker, J. Hostetler,
               S. Lawrence, P. Leach, A. Luotonen, L. Stewart. June, 1999.
            </bibl>
            <bibl id="rfc3023" key="RFC 3023">
               <loc href="http://www.ietf.org/rfc/rfc3023.txt">RFC 3023: XML Media Types</loc>.
               M. Murata, S. St. Laurent, and D. Kohn, editors. Internet Engineering Task Force.
               January, 2001.
            </bibl>
            <bibl id="rfc4918" key="RFC 4918">
               <loc href="http://www.ietf.org/rfc/rfc4918.txt">RFC 4918: HTTP Extensions for Web
                  Distributed Authoring and Versioning (WebDAV)</loc>. L. Dusseault, editor.
               Internet Engineering Task Force. June, 2007. </bibl>
            <bibl id="xserial" key="Serialization">
               <loc href="http://www.w3.org/TR/xslt-xquery-serialization/">XSLT 2.0 and XQuery 1.0
               Serialization</loc>.  Scott Boag, Michael Kay, Joanne Tong, Norman Walsh, and Henry
               Zongaro, editors. W3C Recommendation. 23 January 2007.
            </bibl>
            <bibl id="fo11" key="F&amp;O 1.1">
               <loc href="http://www.w3.org/TR/xpath-functions-11/">XPath and XQuery Functions and
                  Operators 1.1</loc>. Michael Kay, editor. W3C Working Draft. 15 January 2009.
            </bibl>
            <bibl id="tagsoup" key="TagSoup">
               <loc href="http://ccil.org/~cowan/XML/tagsoup/">TagSoup - Just Keep On Truckin'</loc>.
               John Cowan.
            </bibl>
            <bibl id="xproc" key="XProc">
               <loc href="http://www.w3.org/TR/xproc/">XProc: An XML Pipeline Language</loc>.
               N. Walsh, A. Milowski, and H. S. Thompson, editors. W3C Candidate Recommendation.
               26 November 2008.
            </bibl>
            <bibl id="xslt20" key="XSLT 2.0">
               <loc href="http://www.w3.org/TR/xslt20/">XSL Transformations (XSLT) Version 2.0</loc>.
               Michael Kay, editor. W3C Recommendation. 23 January 2007.
            </bibl>
         </blist>
      </div1>
      <div1>
         <head>Summary of Error Conditions</head>
         <blist>
            <bibl id="errHC001" key="err:HC001">An HTTP error occurred.</bibl>
            <bibl id="errHC002" key="err:HC002">Error parsing the entity content as XML or HTML.</bibl>
            <bibl id="errHC003" key="err:HC003">With a multipart response, the override-media-type
               must be either a multipart media type or application/octet-stream.</bibl>
            <bibl id="errHC004" key="err:HC004">The src attribute on the body element is mutually
               exclusive with all other attribute (except the media-type).</bibl>
            <bibl id="errHC005" key="err:HC005">The request element is not valid.</bibl>
            <bibl id="errHC006" key="err:HC006">A timeout occurred waiting for the response.</bibl>
         </blist>
      </div1>
   </back>
</spec>
