Plugin: Split XML or HTML

Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin.

Purpose

Splits HTML and XML documents based on X-Path or CSS selector patterns.

Usage

Enable the plugin

Enable the split-html-xml-filter plugin on your data source from the extensions screen in the administration dashboard or add the following data source configuration to enable the plugin:

plugin.split-html-xml-filter.enabled=true
plugin.split-html-xml-filter.version=1.0.0

Add splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter to the filter chain:

filter.classes=<OTHER-FILTERS>:com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter:<OTHER-FILTERS>
The com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter should be placed at an appropriate position in the filter chain. In most circumstances this should be located towards the start of the filter chain, but after any filters that convert source content into XML.

The plugin will take effect after a full update of the data source.

Plugin configuration settings

The following options can be set in the data source configuration to configure the plugin:

  • plugin.split-html-xml-filter.config.defaultXMLSplit: This field holds a default split pattern for XML files processed by the update. If the JSON config file is missing, or there are not matching URLs within the JSON configuration then this default split pattern will be used. The default split path must be a valid XPath.

  • plugin.split-html-xml-filter.config.defaultHTMLSplit: This field holds a default split pattern for HTML documents processed by the update. If the JSON config file is missing, or there are not matching URLs within the JSON configuration then this default split pattern will be used. The default split path must be a valid Jsoup selector, which is similar to a CSS selector.

  • plugin.split-html-xml-filter.config.keepOriginal: This flag is used to determine if original document needs to be included in the split result documents set. It has default value of false, which means original document will not be included as a part of split result documents.

Alternatively, the plugin can be configured to run on individual URLs through the use of a configuration file named plugin.split-html-xml-filter.rules.json, saved within the conf/plugin-configuration/split-html-xml-filter directory of the data source. This file can currently only be edited using WebDAV.

The content of file must be valid JSON, starting with a top-level array. The JSON fields are:

  • url: URL against which split pattern will be applied. The URL must be an exact match to the URL of the document being processed for the pattern to apply. The URL must be unique. If the same URL is defined more than once the last pattern read will apply.

  • splitPattern: Pattern used to split the xml/html document.

This example shows how source input file and config files look:

Example: XML source input file (url: http://example.com/bookstore.xml)

<libraries>
    <library>
        <books>
            <book>Book1</book>
            <book>Mybook1</book>
        </books>
</library>
<library>
    <books>
        <book>Book2</book>
        <book>Mybook2</book>
    </books>
</library>
</libraries>

Example: HTML source input file (http://example.com/itemdirectory/search.html)

<html>
    <body>
        <div>
            <ul class="item-list">
                <li class="item"> Item 1 </li>
                <li class="item"> Item 2 </li>
                <li class="item"> Item 3 </li>
                <li class="item"> Item 4 </li>
                <li class="item"> Item 5 </li>
                <li class="item"> Item 6 </li>
            </ul>
        </div>
    </body>
</html>

Plugin configuration - plugin.split-html-xml-filter.rules.json file:

[
    {
        "url": "http://example.com/bookstore.xml",
        "splitPattern": "/libraries/library/books/book"
    },
    {
        "url": "http://example.com/itemdirectory/search.html",
        "splitPattern": ".item-list .item"
    }
]

The above configuration will result in two items being added to the index from the XML file, and six from the HTML file.

All versions of split-html-xml-filter