Splitting XML files

Background

This article discusses two different techniques for splitting XML files into separate items within the search index.

Splitting XML using the indexer

The Funnelback indexer includes built in support for the splitting of XML files using a specified XPath.

Using the indexer to split the XML document is ideal if the XML source does not need to be transformed in any way.

After splitting, each record matched by the element path will be indexed as a separate document within Funnelback.

Method: XML processing options

The XML processing options screen allows configuration of an XPath used to split the XML. This value can be selected as the XML document splitting field on the XML processing screen, available on the administer tab in the administration interface.

Splitting XML using the filter framework

This does not apply to the SXC version of Funnelback. A custom plugin that implements this functionality is required within the SXC.

The filter framework can be used to split an XML document into multiple documents that can then be processed further in subsequent filters within the collection’s filter chain.

A string document filter can be implemented that parses the input document text into an XML object, and then splits it into separate documents with unique URLs.

A sample filter is available on the Funnelback GitHub site that can be used for XML document splitting (Note that it has only had some basic testing in Funnelback 15.18 but should work in earlier versions with some modification for some earlier versions that don’t support the Grapes/Grab syntax which is used in the filter). The code can also be adapted to use Groovy’s XML parser (though element paths must be specified using GPaths instead of XPaths when using this parser).

Download: SplitXML filter