xmlTreeParse               package:XML               R Documentation

_X_M_L _P_a_r_s_e_r

_D_e_s_c_r_i_p_t_i_o_n:

     Parses an XML file or string, and generates an R  structure
     representing the XML tree.

_U_s_a_g_e:

     xmlTreeParse(file, ignoreBlanks=T, handlers=NULL, replaceEntities=F,asText=F, trim=T, validate=F, getDTD=F, isURL=F, asTree = F)

_A_r_g_u_m_e_n_t_s:

    file: The name of the file containing the XML contents. This can
          contain \~ which is expanded to the user's home directory. It
          can also be a URL. See `isURL'. Additionally, the file can be
          compressed (gzip) and is read directly without the user
          having to de-compress (gunzip) it.

ignoreBlanks: logical value indicating whether text elements made up
          entirely of white space should be included in the resulting
          `tree'. 

handlers: Optional collection of functions used to map the different
          XML nodes to R objects. This is a named list of functions,
          and a closure can be used to provide local data. This
          provides a way of filtering the tree as it is being created,
          adding or removing nodes, and generally processing them as
          they are constructed in the C code.

replaceEntities: logical value indicating whether to substitute entity
          references with their text directly. This should be left as
          False. The text still appears as the value of the node, but
          there is more information about its source, allowing the
          parse to be reversed with full reference information. 

  asText: logical value indicating that the first argument, `file', 
          should be treated as the XML text to parse, not the name of 
          a file. This allows the contents of documents to be retrieved
           from different sources (e.g. HTTP servers, XML-RPC, etc.)
          and still use this parser.

    trim: whether to strip white space from the beginning and end of
          text strings. 

validate: logical indicating whether to use a validating parser or not,
          or in other words check the contents against the DTD
          specification. If this is true, warning messages will be
          displayed about errors in the DTD and/or document, but the
          parsing  will proceed except for the presence of terminal
          errors. 

  getDTD: logical flag indicating whether the DTD (both internal and
          external) should be returned along with the document nodes.
          This changes the  return type. 

   isURL: indicates whether the `file'  argument refers to a URL
          (accessible via ftp or http) or a regular file on the system.
          If `asText' is TRUE, this should not be specified. The
          function attempts to determine whether the  data source is a
          URL by using `grep' to look for http or ftp at the start of
          the string. The libxml parser handles the connection to
          servers, not the R facilities (e.g. `scan.url'). 

  asTree: this only applies when on passes a value for the  `handlers'
          argument and is used then to determine whether the DOM tree
          should be returned or the `handlers' object. 

_D_e_t_a_i_l_s:

     The `handlers' argument is used similarly to those specifid in
     xmlEventParse. When an XML tag (element) is processed, we look for
     a function in this collection  with the same name as the tag's
     name.  If this is not found, we look for one named startElement.
     If this is not found, we use the default built in converter. The
     same works for comments, entity references, etc. The default
     entries should be named `comment', `startElement',
     `externalEntity', `processingInstruction' `text'. They should take
     the XMLnode as their first argument. In the future, other
     information may be passed via ..., for example, the depth in the
     tree, etc. Specifically, the second argument will be the parent
     node into which they are being added, but this is not currently
     implemented, so should have a default value (`NULL').

     Each of these functions can return arbitrary values that are then
     entered into the tree in place of the default node passed to the
     function as the first argument.  This allows the caller to
     generate the nodes of the resulting document tree exactly as they
     wish.  If the function returns `NULL', in the future, we will drop
     this node from the tree.

_V_a_l_u_e:

     By default, an object of class XML doc is returned, which contains
     fields/slots named  `file', `version' and `children'. 

  `file': The (expanded) name of the file containing the XML.

`version': A string identifying the  version of XML used by the
          document.

`children': A list of the XML nodes at the top of the document. Each of
          these is of class `XMLNode'. These are made up of 4 fields.

          `_n_a_m_e' The name of the element.

          `_a_t_t_r_i_b_u_t_e_s' For regular elements, a named list of XML
                 attributes converted from the  <tag x="1" y="abc">

          `_c_h_i_l_d_r_e_n' List of sub-nodes.

          `_v_a_l_u_e' Used only for text entries. Some nodes
                 specializations of `XMLNode', such as  `XMLComment',
                 `XMLProcessingInstruction', `XMLEntityRef' are used.

                 If the value of the argument getDTD is TRUE, the
                 return value is a list of length 2.  The first element
                 is as the document as described above.  The second
                 element is a list containing the external and internal
                 DTDs. Each of these contains 2 lists - one for
                 elements and another for entities. See `parseDTD'. 

_N_o_t_e:

     Make sure that the necessary 3rd party libraries are available.

_A_u_t_h_o_r(_s):

     Duncan Temple Lang

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://xmlsoft.org>,  <URL: http://www.w3.org/xml>

_S_e_e _A_l_s_o:

     xmlEventParse

_E_x_a_m_p_l_e_s:

      fileName <- system.file("data", "test.xml", pkg="XML")
        # parse the document and return it in its standard format.
      xmlTreeParse(fileName)

        # parse the document, discarding comments.

      xmlTreeParse(fileName, handlers=list("comment"=function(x, parent){NULL}))

      invisible(xmlTreeParse(fileName,
                 handlers=list(entity=function(x) {
                                         cat("In entity",x$name, x$value,"\n")}
                                       )
                               )
              )

      # Parse some XML text.
      # Read the text from the file
      xmlText <- paste(scan(fileName, what="",sep="\n"),"\n", collapse="\n")
      xmlTreeParse(xmlText, asText=T)

      # Read a MathML document and convert each node
      # so that the primary class is 
      #   <name of tag>MathML
      # so that we can use method  dispatching when processing
      # it rather than conditional statements on the tag name.
      # See plotMathML() in examples/.
      fileName <- system.file("data", "mathml.xml",pkg="XML")
     m <- xmlTreeParse(fileName, 
                       handlers=list(startElement=function(node){
                                   cname <- paste(xmlName(node),"MathML",sep="",collapse="")
                                   class(node) <- c(cname, class(node)); 
                                   node
                     }))


       # In this example, we extract _just_ the names of the
       # variables in the mtcars.xml file. 
       # The names are the contents of the <variable>
       # tags. We discard all other tags by returning NULL
       # from the startElement handler.
       #
       # We cumulate the names of variables in a character
       # vector named `vars'.
       # We define this within a closure and define the 
       # variable function within that closure so that it
       # will be invoked when the parser encounters a <variable>
       # tag.
       # This is called with 2 arguments: the XMLNode object (containing
       # its children) and the list of attributes.
       # We get the variable name via call to xmlValue().

       # Note that we define the closure function in the call and then 
       # create an instance of it by calling it directly as
       #   (function() {...})()

       # Note that we can get the names by parsing
       # in the usual manner and the entire document and then executing
       # xmlSApply(xmlRoot(doc)[[1]], function(x) xmlValue(x[[1]]))
       # which is simpler but is more costly in terms of memory.
      fileName <- system.file("data", "mtcars.xml")
      doc <- xmlTreeParse(fileName,  handlers = (function() { 
                                      vars <- character(0) ;
                                     list(variable=function(x, attrs) { 
                                                     vars <<- c(vars, xmlValue(x[[1]])); 
                                                     NULL}, 
                                          startElement=function(x,attr){
                                                        NULL
                                                       }, 
                                          names = function() {
                                                      vars
                                                  }
                                         )
                                    })()
                          )

       # Here we just print the variable names to the console
       # with a special handler.
      doc <- xmlTreeParse(fileName, handlers = list(
                                       variable=function(x, attrs) {
                                                  print(xmlValue(x[[1]])); T
                                                }))



      # Parse an XML document directly from a URL.
      # Requires Internet access.
      xmlTreeParse("http://www.omegahat.org/Scripts/Data/mtcars.xml", asText=T)

