Implementer training - Working with XML content

Funnelback can index XML documents and there are some additional configuration files that are applicable when indexing XML files.

  • You can map elements in the XML structure to Funnelback metadata classes.

  • You can display cached copies of the document via XSLT processing.

Funnelback can be configured to index XML content, creating an index with searchable, fielded data.

Funnelback metadata classes are used for the storage of XML data – with configuration that maps XML element paths to internal Funnelback metadata classes – the same metadata classes that are used for the storage of HTML page metadata. An element path is a simple XML X-Path.

XML files can be optionally split into records based on an X-Path. This is useful as XML files often contain a number of records that should be treated as individual result items.

Each record is then indexed with the XML fields mapped to internal Funnelback metadata classes as defined in the XML mappings configuration file.

XML configuration

The data source’s XML configuration defines how Funnelback’s XML parser will process any XML files that are found when indexing.

The XML configuration is made up of two parts:

  1. XML special configuration

  2. Metadata classes containing XML field mappings

The XML parser is used for the parsing of XML documents and also for indexing of most non-web data. The XML parser is used for:

  • XML, CSV and JSON files,

  • Database, social media, directory, HP CM/RM/TRIM and most custom data sources.

XML element paths

Funnelback element paths are simple X-Paths that select on fields and attributes.

Absolute and unanchored X-Paths are supported, however for some special XML fields absolute paths are required.

  • If the path begins with / then the path is absolute (it matches from the top of the XML structure).

  • If the path begins with // it is unanchored (it can be located anywhere in the XML structure).

XML attributes can be used by adding @attribute to the end of the path.

Element paths are case sensitive.

Attribute values are not supported in element path definitions.

Example element paths:

X-Path Valid Funnelback element path

/items/item

VALID

//item/keywords/keyword

VALID

//keyword

VALID

//image@url

VALID

/items/item[@type=value]

NOT VALID

Interpretation of field content

CDATA tags can be used with fields that contain reserved characters or the characters should be HTML encoded.

Fields containing multiple values should be delimited with a vertical bar character or the field repeated with a single value in each repeated field.

e.g. The indexed value of //keywords/keyword and //subject below would be identical.

<keywords>
    <keyword>keyword 1</keyword>
    <keyword>keyword 2</keyword>
    <keyword>keyword 3</keyword>
</keywords>

<subject>keyword 1|keyword 2|keyword 3</subject>

XML special configuration

There are a number of special properties that can be configured when working with XML files. These options are defined from the XML configuration screen, by selecting configure XML processing from the settings panel on the data source management screen.

manage data source panel settings
xml special configuration 01

XML document splitting

A single XML file is commonly used to describe many items. Funnelback includes built-in support for splitting an XML file into separate records.

Absolute X-Paths must be used and should reference the root element of the items that should be considered as separate records.

Splitting an XML document using this option is not available on push indexes. A filter can be used as an alternative approach for splitting documents when using a push index.

Document URL

The document URL field can be used to identify XML fields containing a unique identifier that will be used by Funnelback as the URL for the document. If the document URL is not set then Funnelback auto-generates a URL based on the URL of the XML document. This URL is used by Funnelback to internally identify the document, but is not a real URL.

Setting the document URL to an XML attribute is not supported.

Setting a document url is not available on push collections.

Document file type

The document file type field can be used to identify an XML field containing a value that indicates the filetype that should be assigned to the record. This is used to associate a file type with the item that is indexed. XML records are commonly used to hold metadata about a record (e.g. from a records management system) and this may be all the information that is available to Funnelback when indexing a document from such as system.

Special document elements

The special document elements can be used to tell Funnelback how to handle elements containing content.

Inner HTML or XML documents

The content of the XML field will be treated as a nested document and parsed by Funnelback and must be XML encoded (i.e. with entities) or wrapped in a CDATA declaration to ensure that the main XML document is well-formed.

The indexer will guess the nested document type and select the appropriate parser:

The nested document will be parsed as XML if (once decoded) it is well-formed XML and starts with an XML declaration similar to <?xml version="1.0" encoding="UTF-8" />. If the inner document is identified as XML it will be parsed with the XML parser and any X-Paths of the nested document can also be mapped. Note: the special XML fields configured on the advanced XML processing screen do not apply to the nested document. For example, this means you can’t split a nested document.

The nested document will be parsed as HTML if (once decoded) when it starts with a root <html> tag. Note that if the inner document contains HTML entities but doesn’t start with a root <html> tag, it will not be detected as HTML. If the inner document is identified as HTML and contains metadata then this will be parsed as if it was an HTML document, with embedded metadata and content extracted and associated with the XML records. This means that metadata fields included in the embedded HTML document can be mapped in the metadata mappings along with the XML fields.

The inner document in the example below will not be detected as HTML:

<root>
  <name>Example</name>
  <inner>This is &lt;strong&gt;an example&lt;/strong&gt;</inner>
</root>

This one will:

<root>
  <name>Example</name>
  <inner>&lt;html&gt;This is &lt;strong&gt;an example&lt;/strong&gt;&lt;/html&gt;</inner>
</root>

Indexable document content

Any listed X-Paths will be indexed as un-fielded document content - this means that the content of these fields will be treated as general document content but not mapped to any metadata class (similar to the non-metadata field content of an HTML file).

For example, if you have the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <title>Example</title>
  <inner>
  <![CDATA[
  <html>
    <head>
      <meta name="author" content="John Smith">
    </head>
    <body>
      This is an example
    </body>
  </html>
  ]]>
  </inner>
</root>

With an Indexable document content path to //root/inner, the document content will have This is an example; however, the metadata author will not be mapped to "John Smith". To have the metadata mapped as well, the Inner HTML or XML document path should be used instead.

Include unmapped elements as content

If there are no indexable document content paths mapped, Funnelback can optionally choose to how to handle the unmapped fields. When this option is selected then all unmapped XML fields will be considered part of the general document content.

Tutorial: Creating an XML data source

In this exercise you will create a searchable index based off records contained within an XML file.

A web data source is used to create a search of a content sourced by crawling one or more websites. This can include the fetching of one or more specific URLs for indexing.

For this exercise we will use a web data source, though XML content can exist in many different data sources types (e.g. custom, database, directory, social media).

The XML file that we will be indexing includes a number of individual records that are contained within a single file, an extract of which is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<tsvdata>
  <row>
    <Airport_ID>1</Airport_ID>
    <Name>Goroka</Name>
    <City>Goroka</City>
    <Country>Papua New Guinea</Country>
    <IATA_FAA>GKA</IATA_FAA>
    <ICAO>AYGA</ICAO>
    <Latitude>-6.081689</Latitude>
    <Longitude>145.391881</Longitude>
    <Altitude>5282</Altitude>
    <Timezone>10</Timezone>
    <DST>U</DST>
    <TZ>Pacific/Port_Moresby</TZ>
    <LATLONG>-6.081689;145.391881</LATLONG>
  </row>
  <row>
    <Airport_ID>2</Airport_ID>
    <Name>Madang</Name>
    <City>Madang</City>
    <Country>Papua New Guinea</Country>
    <IATA_FAA>MAG</IATA_FAA>
    <ICAO>AYMD</ICAO>
    <Latitude>-5.207083</Latitude>
    <Longitude>145.7887</Longitude>
    <Altitude>20</Altitude>
    <Timezone>10</Timezone>
    <DST>U</DST>
    <TZ>Pacific/Port_Moresby</TZ>
    <LATLONG>-5.207083;145.7887</LATLONG>
  </row>
  <row>
    <Airport_ID>3</Airport_ID>
    <Name>Mount Hagen</Name>
    <City>Mount Hagen</City>
    <Country>Papua New Guinea</Country>
    <IATA_FAA>HGU</IATA_FAA>
...

Indexing an XML file like this requires two main steps:

  1. Configuring Funnelback to fetch and split the XML file

  2. Mapping the XML fields for each record to Funnelback metadata classes.

Exercise steps

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Create a new search package named Airports.

  3. Add a web data source to this search package with the following attributes:

    Type

    web

    Name

    Airports data

    What website(s) do you want to crawl?

    https://docs.squiz.net/training-resources/airports/airports.xml

  4. Update the data source.

  5. Create a results page (as part of the Airports search package) named Airport finder.

  6. Run a search for !showall using the search preview. Observe that only one result (to the source XML file) is returned.

    if you place an exclamation mark before a word in a query this negates the term and the results will include items that don’t contain the word. e.g. !showall, used in the example above, means return results that do not contain the word showall.
    exercise creating an xml collection 01
  7. Configure Funnelback to split the XML files into records. To do this we need to inspect the XML file(s) to see what elements are available.

    Ideally you will know the structure of the XML file before you start to create your collection. However, if you don’t know this and the XML file isn’t too large you might be able to view it by inspecting the cached version of the file, available from the drop down link at the end of the URL. If the file is too large the browser may not be able to display the file.

    Inspecting the XML (displayed above) shows that each airport record is contained within the <row> element that is nested beneath the top level <tsvdata> element. This translates to an X-Path of /tsvdata/row.

    Navigate to the manage data source screen and select configure XML processing from the settings panel. The XML processing screen is where all the special XML options are set.

    manage data source panel settings
  8. The XML document splitting field configures the X-Path(s) that are used to split an XML document. Select /tsvdata/row from the listed fields.

    exercise creating an xml collection 03
  9. If possible also set the document URL to an XML field that contains a unique identifier for the XML record. This could be a real URL, or some other sort of ID. Inspecting the airports XML shows that the Airport_ID can be used to uniquely identify the record. Select /tsvdata/row/Airport_ID from the dropdown for the document URL field.

    exercise creating an xml collection 04
    If you don’t set a document URL Funnelback will automatically assign a URL.
  10. Leave the other fields unchanged then click the save button.

  11. The index must be rebuilt for any XML processing changes to be reflected in the search results. Return to the manage data source screen then rebuild the index by selecting start advanced update from the update panel, then selecting rebuild live index.

  12. Run a search for !showall using the search preview and confirm that the XML file is now being split into separate items. The search results display currently only shows the URLs of each of the results (which in this case is just an ID number). In order to display sensible results the XML fields must be mapped to metadata and displayed by Funnelback.

    exercise creating an xml collection 05
  13. Map XML fields by selecting configure metadata mappings from the settings panel.

    manage data source panel settings
  14. The metadata screen lists a number of pre-configured mappings. Because this is an XML data set, the mappings will be of no use so clear all the mappings by selecting clear all metadata mappings from the tools menu.

    exercise creating an xml collection 07
  15. Click the add new button to add a new metadata mapping. Create a class called name that maps the <Name> xml field. Enter the following into the creation form:

    Class

    name

    Type

    Text

    Search behavior

    Searchable as content

    exercise creating an xml collection 08
  16. Add a source to the metadata mapping. Click the add new button in the sources box. This opens up a window that displays the metadata sources that were detected when the index was built. Display the detected XML fields by clicking on the XML path button for the type of source, then choose the Name (/tsvdata/row/Name) field from the list of available choices the click the save button.

    exercise creating an xml collection 09
  17. You are returned to the create mapping screen. Click the add new mapping button to create the mapping.

    exercise creating an xml collection 10
  18. The metadata mappings screen updates to show the newly created mapping for Name.

    exercise creating an xml collection 11
  19. Repeat the above process to add the following mappings. Before adding the mapping switch the editing context to XML - this will mean that XML elements are displayed by default when selecting the sources.

    exercise creating an xml collection 12
    Class name Source Type Search behaviour

    city

    /tsvdata/row/City

    text

    searchable as content

    country

    /tsvdata/row/Country

    text

    searchable as content

    iataFaa

    /tsvdata/row/IATA_FAA

    text

    searchable as content

    icao

    /tsvdata/row/ICAO

    text

    searchable as content

    altitude

    /tsvdata/row/Altitude

    text

    searchable as content

    latlong

    /tsvdata/row/LATLONG

    text

    display only

    latitude

    /tsvdata/row/Latitude

    text

    display only

    longitude

    /tsvdata/row/Longitude

    text

    display only

    timezone

    /tsvdata/row/Timezone

    text

    display only

    dst

    /tsvdata/row/DST

    text

    display only

    tz

    /tsvdata/row/TZ

    text

    display only

  20. Note that the metadata mappings screen is displaying a message:

    These mappings have been updated since the last index, perform a re-index to apply all of these mappings.

    Rebuild the index by navigating to the manage data source screen then selecting start advanced update from the settings panel. Select rebuild live index from the rebuild live index section and click the update button.

  21. Display options will need to be configured so that the metadata is returned with the search results. Reminder: this needs to be set up on the airport finder results page. Edit the results page configuration, and add summary fields to include the name, city, country, iataFaa, icao and altitude metadata to the query processor options:

    -stem=2 -SF=[name,city,country,iataFaa,icao,altitude]
  22. Rerun the search for !showall and observe that metadata is now returned for the search result.

    exercise creating an xml collection 13
  23. Inspect the data model. Reminder: edit the url changing search.html to search.json. Inspect the response element - each result should have fields populated inside the listMetadata sub elements of the result items. These can then be accessed from your Freemarker template and printed out in the search results.

    exercise creating an xml collection 14