Plugin: Split XML or HTML

Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin.

Purpose

Splits HTML and XML documents based on X-Path or CSS selector patterns.

Usage

Enable the plugin

Enable the split-html-xml-filter plugin on your data source from the extensions screen in the administration dashboard or add the following data source configuration to enable the plugin:

plugin.split-html-xml-filter.enabled=true
plugin.split-html-xml-filter.version=1.0.1

Add splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter to the filter chain:

filter.classes=<OTHER-FILTERS>:com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter:<OTHER-FILTERS>
The com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter should be placed at an appropriate position in the filter chain. In most circumstances this should be located towards the start of the filter chain, but after any filters that convert source content into XML.

The plugin will take effect after a full update of the data source.

Plugin configuration settings

The following options can be set in the data source configuration to configure the plugin:

  • plugin.split-html-xml-filter.config.defaultXMLSplit: This field holds a default split pattern for XML files processed by the update. If the JSON config file is missing, or there are not matching URLs within the JSON configuration then this default split pattern will be used. The default split path must be a valid XPath.

  • plugin.split-html-xml-filter.config.defaultHTMLSplit: This field holds a default split pattern for HTML documents processed by the update. If the JSON config file is missing, or there are not matching URLs within the JSON configuration then this default split pattern will be used. The default split path must be a valid Jsoup selector, which is similar to a CSS selector.

  • plugin.split-html-xml-filter.config.keepOriginal: This flag is used to determine if original document needs to be included in the split result documents set. It has default value of false, which means original document will not be included as a part of split result documents.

  • plugin.split-html-xml-filter.config.defaultXMLUrlPath: (Optional) This field sets a default location from which to source the URL. The location is relative to a document after it has been split. If an absolute XPATH is used, it must be based on the split document instead of the original document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents. The default value defined in this key will be used if there is no JSON config file or there are no matching URLs within the JSON configuration. The default URL path must be a valid XPath.

  • plugin.split-html-xml-filter.config.defaultHTMLUrlPath:(Optional) This field sets a default location from which to source the URL. The location is relative to a document after it has been split. If an absolute path is used, it must be based on the split document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents. The default value defined in this key will be used if there is no JSON config file or there are no matching URLs within the JSON configuration. The default URL path must be a valid Jsoup selector, which is similar to a CSS selector.

  • plugin.split-html-xml-filter.config.defaultHtmlUrlPathAttribute: (Optional, is only effective if plugin.split-html-xml-filter.config.defaultHTMLUrlPath is set) This field sets a default attribute of the plugin.split-html-xml-filter.config.defaultHTMLUrlPath Jsoup selector to get the URL. If this is not set, it will use the text inside the Jsoup selector. If the JSON config file is missing, or there are no matching URLs within the JSON configuration, then this default Jsoup selector attribute will be used.

  • plugin.split-html-xml-filter.config.defaultXMLUrlFormat: (Optional, is only effective if plugin.split-html-xml-filter.config.defaultXMLUrlPath is set) This field holds a default format of the generated URL. It uses Java MessageFormat to generate the text. The default format is {0}. If the JSON config file is missing, or there are no matching URLs within the JSON configuration then this default URL format will be used.

  • plugin.split-html-xml-filter.config.defaultHTMLUrlFormat: (Optional, is only effective if plugin.split-html-xml-filter.config.defaultHTMLUrlPath is set) This field holds a default format of the generated URL. It uses Java MessageFormat to generate the text. The default format is {0}. If the JSON config file is missing, or there are not matching URLs within the JSON configuration, then this default URL format will be used.

If the XPath/Jsoup selector element does not exist or the generated URL is not a valid absolute URL, it will fall back to the default URL based on the document URL.

Alternatively, the plugin can be configured to run on individual URLs through the use of a configuration file named plugin.split-html-xml-filter.rules.json, saved within the conf/plugin-configuration/split-html-xml-filter directory of the data source. This file can currently only be edited using WebDAV.

The content of file must be valid JSON, starting with a top-level array. The JSON fields are:

  • url: URL against which split pattern will be applied. The URL must be an exact match to the URL of the document being processed for the pattern to apply. The URL must be unique. If the same URL is defined more than once the last pattern read will apply.

  • splitPattern: Pattern used to split the xml/html document.

  • urlPath: (Optional) Pattern used to locate the path of the URL in the split document.

  • urlPathAttr: (Optional, is only effective if urlPath is set and the document is HTML document) The attribute of the urlPath selector to use for the path of URL.

  • urlFormat: (Optional, is only effective if urlPath is set) The format of the URL.

Examples

Example source input and configuration files are shown below:

Example: XML source input file

<libraries>
    <library>
        <books>
            <book>
                <name>Book1</name>
                <link>/Book1.html</link>
            </book>
            <book>
                <name>Mybook1</name>
                <link>/MyBook1.html</link>
            </book>
        </books>
    </library>
    <library>
        <books>
            <book>
                <name>Book2</name>
                <link>/Book2.html</link>
            </book>
            <book>
                <name>Mybook2</name>
                <link>/MyBook2.html</link>
            </book>
        </books>
    </library>
</libraries>

Example: HTML source input file

<html>
    <body>
        <div>
            <ul class="item-list">
                <li class="item"> <a href="/items1.html">Item 1</a> </li>
                <li class="item"> <a href="/items2.html">Item 2</a> </li>
                <li class="item"> <a href="/items3.html">Item 3</a> </li>
                <li class="item"> <a href="/items4.html">Item 4</a> </li>
                <li class="item"> <a href="/items5.html">Item 5</a> </li>
                <li class="item"> <a href="/items6.html">Item 6</a> </li>
            </ul>
        </div>
    </body>
</html>

Example plugin configuration

Configuration file: plugin.split-html-xml-filter.rules.json

[
    {
        "url": "http://example.com/bookstore.xml",
        "splitPattern": "/libraries/library/books/book",
        "urlPath": "//link",
        "urlFormat": "https://example.com/bookstore/books{0}"
    },
    {
        "url": "http://example.com/itemdirectory/search.html",
        "splitPattern": ".item-list .item",
        "urlPath": ".item a",
        "urlPathAttr": "href",
        "urlFormat": "https://example.com/product{0}"
    }
]

The above configuration will result in two items being added to the index from the XML file, and six from the HTML file.

Change log

[1.0.1]

Added

  • Added the ability to source the URL from a location within the split document instead of receiving and auto-assigned URL.

All versions of split-html-xml-filter