Splitting XML files

Background

This article discusses two different techniques for splitting XML files into separate items within the search index.

Splitting XML using the indexer

The Funnelback indexer includes built in support for the splitting of XML files using a specified XPath.

Using the indexer to split the XML document is ideal if the XML source does not need to be transformed in any way.

After splitting, each record matched by the element path will be indexed as a separate document within Funnelback.

Method: XML processing options (Funnelback 15.14 and newer)

The XML processing options screen allows configuration of an XPath used to split the XML. This value can be selected as the XML document splitting field on the XML processing screen, available on the administer tab in the administration interface.

Method: xml.cfg (Funnelback 15.12 and earlier)

XML configuration in Funnelback 15.12 and earlier used xml.cfg to configure both the XML field mappings as well as other XML options.

The docurl field in xml.cfg is used to set an XPath to use to split the XML document into individual files.

Splitting XML using the filter framework (Funnelback 15.8 and newer)

The filter framework in Funnelback 15.8 and newer can be used to split an XML document into multiple documents that can then be processed further in subsequent filters within the collection’s filter chain.

A string document filter can be implemented that parses the input document text into an XML object, and then splits it into separate documents with unique URLs.

A sample filter is available on the Funnelback GitHub site that can be used for XML document splitting (Note that it has only had some basic testing in Funnelback 15.18 but should work in earlier versions with some modification for some earlier versions that don’t support the Grapes/Grab syntax which is used in the filter). The code can also be adapted to use Groovy’s XML parser (though element paths must be specified using GPaths instead of XPaths when using this parser).

Download: SplitXML filter