XML documents

Funnelback can index XML documents and there are some additional configuration files that are applicable to indexing XML files.

Funnelback can be configured to index XML content, creating an index with searchable, fielded data.

Funnelback metadata classes are used for the storage of XML data — with configuration that maps XML element paths to internal Funnelback metadata classes — the same metadata classes that are used for the storage of HTML page metadata. An element path is a simple XML XPath.

XML files can be optionally split into records based on an XPath. This is useful as XML files often contain a number of records that should be treated as individual result items.

Each record is then indexed with the XML fields mapped to internal Funnelback metadata classes as defined in the XML mappings configuration file.

XML configuration

The collection’s XML configuration defines how Funnelback’s XML parser will process any XML files that are found when indexing.

The XML configuration is made up of two parts:

  1. XML special configuration

  2. Metadata classes containing XML field mappings

The XML parser is used for the parsing of XML documents and also for indexing of most non-web data. The XML parser is used for:

  • XML, CSV and JSON files,

  • Database, social media, directory, HP CM/RM/TRIM and most custom collections.

XML element paths

Funnelback element paths are simple XPaths that select on fields and attributes.

Absolute and unanchored XPaths are supported, however for some special XML fields absolute paths are required.

  • If the path begins with / then the path is absolute (it matches from the top of the XML structure).

  • If the path begins with // it is unanchored (it can be located anywhere in the XML structure).

XML attributes can be used by adding @attribute to the end of the path.

Element paths are case sensitive.

Attribute values are not supported in element path definitions.

Example element paths:

XPath Valid Funnelback element path

/items/item

VALID

//item/keywords/keyword

VALID

//keyword

VALID

//image@url

VALID

/items/item[@type=value]

NOT VALID

Interpretation of field content

  • CDATA tags can be used with fields that contain reserved characters, or the characters should be HTML encoded.

  • Fields containing multiple values should be delimited with a vertical bar character, or the field repeated with a single value in each repeated field.

e.g. The indexed value of //keywords/keyword and //subject below would be identical.

<keywords>
    <keyword>keyword 1</keyword>
    <keyword>keyword 2</keyword>
    <keyword>keyword 3</keyword>
</keywords>

<subject>keyword 1|keyword 2|keyword 3</subject>

XML special configuration

There are a number of special properties that can be configured when working with XML files. These options are defined from the XML configuration screen, by selecting XML processing from the data source management screen of the search dashboard.

xml-admin-1.png

XML document splitting

This option does not apply to push data sources. To split XML for a push data source use the split XML or HTML plugin.

A single XML file is commonly used to describe many items. Funnelback includes built-in support for basic splitting of an XML file into separate records.

There are some limitations when using the built-in splitting:

  • Built-in splitting occurs when the XML is indexed, so it’s not possible to make further modifications to your records by adding filters.

  • The cache controller does not work with records that have been split using the built-in splitter.

  • The built-in splitter does not work with push data sources.

  • Absolute XPaths must be used and should reference the root element of the items that should be considered as separate records.

  • The XPath support is limited to absolute element paths.

The split XML or HTML plugin removes most of these limitations and is now the recommended way of splitting an XML document.

Document URL

The document URL field can be used to identify XML fields containing a unique identifier that will be used by Funnelback as the URL for the document. If the document URL is not set then Funnelback auto-generates a URL based on the URL of the XML document. This URL is used by Funnelback to internally identify the document, but is not a real URL.

  • Setting the document URL to an XML attribute is not supported.

  • Setting a document url is not available on Push collections.

  • If the 'Inner HTML or XML documents' feature is used (see below), the field containing the URL must be above all fields containing inner HTML or XML documents.

Document file type

The document file type field can be used to identify an XML field containing a value that indicates the filetype that should be assigned to the record. This is used to associate a file type with the item that is indexed. XML records are commonly used to hold metadata about a record (e.g. from a records management system) and this may be all the information that is available to Funnelback when indexing a document from such as system.

Special document elements

The special document elements can be used to tell Funnelback how to handle elements containing content.

Inner HTML or XML documents

The content of the XML field will be treated as a nested document and parsed by Funnelback and must be XML encoded (i.e. with entities) or wrapped in a CDATA declaration to ensure that the main XML document is well-formed.

The indexer will guess the nested document type and select the appropriate parser:

  • The nested document will be parsed as XML if (once decoded) it is well-formed XML and starts with an XML declaration similar to <?xml version="1.0" encoding="UTF-8" />. If the inner document is identified as XML it will be parsed with the XML parser and any XPaths of the nested document can also be mapped. Note: the special XML fields configured on the advanced XML processing screen do not apply to the nested document. For example, this means you can’t split a nested document.

  • The nested document will be parsed as HTML if (once decoded) when it starts with a root <html> tag. Note that if the inner document contains HTML entities but doesn’t start with a root <html> tag, it will not be detected as HTML. If the inner document is identified as HTML and contains metadata then this will be parsed as if it was a HTML document, with embedded metadata and content extracted and associated with the XML records. This means that metadata fields included in the embedded HTML document can be mapped in the metadata mappings along with the XML fields.

    You can use the wrap XML elements in HTML tags plugin to wrap a field in HTML tags if your candidate inner document doesn’t have the required <html> tags.
  • If a field in the XML should be parsed as HTML and treated as the title of the record/result, this field must be surrounded by a root <html> tag and a <head> tag and a <title> tag. Without the <head> and <title> tags, the HTML parser will treat that field as if it was content in the body of an HTML document, not as a title.

The inner document in the example below will not be detected as HTML:

<root>
  <name>Example</name
  <inner>This is &lt;strong&gt;an example&lt;/strong&gt;</inner>
</root>

This one will:

<root>
  <name>Example</name>
  <inner>&lt;html&gt;This is &lt;strong&gt;an example&lt;/strong&gt;&lt;/html&gt;</inner>
</root>

Example of an inner HTML document using a CDATA section:

<root>
  <name>Example</name>
  <inner><![CDATA[<html>This is <strong>an example</strong></html>]]></inner>
</root>

Example of a field parsed as HTML and assigned as the title of the record.

<root>
  <name>&lt;html&gt;&lt;head&gt;&lt;title&gt;Department of Family &amp; Support Services&lt;/title&gt;&lt;/head&gt;&lt;/html&gt;</name>
  <description>...</description>
  <url>...</url>
</root>

Example of a field parsed as HTML using a CDATA section and assigned as the title of the record.

<root>
  <name><![CDATA[<html><head><title>Department of Family &amp; Support Services</title></head></html>]]></name>
  <description>...</description>
  <url>...</url>
</root>

Indexable document content

Any listed XPaths will be indexed as unfielded document content - this means that the content of these fields will be treated as general document content but not mapped to any metadata class.

For example, if you have the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <title>Example</title>
  <inner>
  <![CDATA[
  <html>
    <head>
      <meta name="author" content="John Smith">
    </head>
    <body>
      This is an example
    </body>
  </html>
  ]]>
  </inner>
</root>

With an Indexable document content path to //root/inner, the document content will have This is an example; however, the metadata author will not be mapped to "John Smith". To have the metadata mapped as well, the Inner HTML or XML document path should be used instead.

Include unmapped elements as content

If there are no indexable document content paths mapped, Funnelback can optionally choose to how to handle the unmapped fields. When this option is selected then all unmapped XML fields will be considered part of the general document content.

Example: electronic books

Let’s say you had a number of XML files representing electronic-books similar to:

<book>
  <info>
  <title> The Adventures Of Sherlock Holmes </title>
  <author> Arthur Conan Doyle </author>
  </info>

  <contents>
    <chapter>A Scandal in Bohemia</chapter>
    <chapter>The Red-headed League</chapter>
...
    <chapter>The Adventure of the Copper Beeches</chapter>
  </contents>
</book>

Metadata

To map this XML structure to metadata classes for the author (to bookAuthor), title (bookTitle) and chapters (bookChapter), create mappings for the following :

  • map the //author or /book/author XPath to a metadata class named bookAuthor.

  • map the //title or /book/title XPath to a metadata class named bookTitle.

  • map the //chapter, //contents/chapter or /book/contents/chapter XPath to a metadata class named bookChapter.

When this data is indexed, the text from these elements will be indexed and assigned to the specified metadata classes.

Presentation

An XSL stylesheet can be defined to present the XML record as a HTML web page, so that a page is rendered when clicking on the search result link.

  1. Create the template.xsl stylesheet to convert the XML into HTML.

  2. Change the collection’s search forms to use the cache_url instead of the live_url.

Crawling XML files

To crawl XML files you will need to ensure that the crawler.parser.mimeTypes parameter includes text/xml as one of the MIME types the web crawler will accept.