Plugin: Split XML or HTML

Purpose

Use this plugin if you have XML or HTML documents that you wish to split and index as separate search result items. Splits HTML and XML documents based on X-Path or CSS selector patterns.

Common usage scenarios

You have an XML file that containing multiple items that should be indexed separately

Consider the following XML fragment that contains multiple book items:

<?xml version="1.0" encoding="utf-8" ?>
<library>
    <book>
        <title>To Kill A Mockingbird</title>
        <author>Harper Lee</author>
        <url>http://examplelib.com/classics/089748543343.html</url>
    </book>
    <book>
        <title>Animal Farm</title>
        <author>George Orwell</author>
        <url>http://examplelib.com/classics/57234677556.html</url>
    </book>
    <book>
        <title>Jane Eyre</title>
        <author>Charlotte Brontë</author>
        <url>http://examplelib.com/classics/466666367563.html</url>
    </book>
...
</library>

The plugin can be used to split this XML file into individual book items, each of which would appear in the search as a distinct result.

You have an HTML page that lists out multiple items that should be indexed separately

Consider the following HTML page that lists out some upcoming events.

...
    <li class="event">
        <h4>Spring in the park</h4>
        <cite>City Park, 5 Sep 2023, all day</cite>
        <p>
        Spend a relaxing day out in the park, with live music and food.</p>
    </li>
    <li class="event">
        <h4>George Gershwin in concert</h4>
        <cite>City Concert Hall, 5 Sep 2023, 8pm</cite>
        <p>
        Spend a relaxing day out in the park, with live music and food.</p>
        <p>
        Tickets: <a href="https://eventtickets.com/event/5757434255693278875">Buy tickets</a></p>
    </li>
    <li class="event">
 ...

The plugin can be configured to index each of these event entries as separate items by using a CSS style selector to target the list of event items.

Splitting JSON

This plugin can also be used to split JSON documents by combining it with the JSONToXML built-in filter

To set this up you must ensure that the JSONToXML filter is set in your filter chain (filter.classes), and when you configure this plugin ensure that the SplitHtmlXmlFilterStringFilter filter runs after the JSONToXML filter.

Added metadata

This plugin adds three additional metadata fields to the split content:

  • X-Funnelback-SplitHtmlXmlOriginalUrl: This contains the URL of the original document prior to splitting.

  • X-Funnelback-SplitHtmlXmlSplitPath: This contains the element path that was used to split the document.

  • X-Funnelback-SplitHtmlXmlExtractedItem: This is set to true if this item was generated as a result of the splitter running.

These fields will be available for mapping on the metadata mappings configuration screen, as header metadata.

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Split XML or HTML tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Keep the original document

Configuration key

plugin.split-html-xml-filter.config.keepOriginal

Data type

boolean

Default value

false

Required

This setting is optional

Should the original (source) document be kept and included in the index? The original document is usually discarded, however it can be retained by setting this to 'Yes'.

Default XPath for splitting XML documents

Configuration key

plugin.split-html-xml-filter.config.defaultXMLSplit

Data type

string

Required

This setting is optional

Defines an XPath, relative to the source document root, which is used to split any XML files processed during the update.

Any matching rules in the JSON configuration file will override this default value. The split path must be a valid XPath.

Default XPath for sourcing the URL of split XML documents

Configuration key

plugin.split-html-xml-filter.config.defaultXMLUrlPath

Data type

string

Required

This setting is optional

This XPath defines the location, relative to the split document root, from which to source the split document’s URL.

If an absolute XPath is used, it must be based on the split document instead of the original document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents.

Any matching rules in the JSON configuration file will override this default value. The default XML URL path must be a valid XPath.

Default URL format for XML

Configuration key

plugin.split-html-xml-filter.config.defaultXMLUrlFormat

Data type

string

Default value

{0}

Required

This setting is optional

Defines the default format of the generated URL using Java Message Format.

Any matching rules in the JSON configuration file will override this default value.

This setting only applies if the default XPath for sourcing the URL of split XML documents is set.

Default Jsoup selector for splitting HTML documents

Configuration key

plugin.split-html-xml-filter.config.defaultHTMLSplit

Data type

string

Required

This setting is optional

Defines a Jsoup selector, relative to the source document root, which is used to split any HTML files processed during the update.

Any matching rules in the JSON configuration file will override this default value. The split path must be a valid Jsoup selector, which is similar to a CSS selector.

Default Jsoup selector for sourcing the URL of split HTML documents

Configuration key

plugin.split-html-xml-filter.config.defaultHTMLUrlPath

Data type

string

Required

This setting is optional

This Jsoup selector defines location, relative to the split document’s root, from which to source the split document’s URL.

If an absolute selector is used, it must be based on the split document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents.

Any matching rules in the JSON configuration file will override this default value. The split path must be a valid Jsoup selector, which is similar to a CSS selector.

Default Jsoup attribute for sourcing the URL of split HTML documents

Configuration key

plugin.split-html-xml-filter.config.defaultHtmlUrlPathAttribute

Data type

string

Required

This setting is optional

Defines the attribute to use in conjunction with the defined Jsoup selector when assigning the document URL.

If this is not set, it will use the text inside the Jsoup selector.

Any matching rules in the JSON configuration file will override this default value.

Default URL format for HTML

Configuration key

plugin.split-html-xml-filter.config.defaultHTMLUrlFormat

Data type

string

Default value

{0}

Required

This setting is optional

Defines the default format of the generated URL using Java Message Format.

Any matching rules in the JSON configuration file will override this default value.

This setting only applies if the default Jsoup selector for sourcing the URL of split HTML documents is set.

Default include HTML title for all split HTML documents

Configuration key

plugin.split-html-xml-filter.config.includeHTMLTitle

Data type

boolean

Default value

true

Required

This setting is optional

Determines if the original (source) documents title tag is added to all the split documents

Default include HTML metadata for all split HTML documents

Configuration key

plugin.split-html-xml-filter.config.includeHTMLMeta

Data type

boolean

Default value

true

Required

This setting is optional

Determines if the source document metadata, contained within <meta> tags, is included in the split documents.

XML namespaces are not supported due to a limitation in the Jsoup library.

However, you can still split a document with namespaced keys by omitting the namespace in your XPaths.

For example, if the XML has a namespace like xmlns:ns="http://www.example.com", the XPath should be /root/element instead of /root/ns:element.

See: [XML file with namespace] for a detail example.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter

Drag the com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter plugin filter to where you wish it to run in the filter chain sequence.

Configuration files

This plugin also uses the following configuration files to provide additional configuration.

rules.json

Description

Use this configuration file if you need to set split rules that apply to specific URLs.

Configuration file format

json

Configuration file example: rules.json
[
    {
        "url": "http://example.com/bookstore.xml",
        "splitPattern": "/libraries/library/books/book",
        "urlPath": "//link",
        "urlFormat": "https://example.com/bookstore/books{0}"
    },
    {
        "url": "http://example.com/itemdirectory/search.html",
        "splitPattern": ".item-list .item",
        "urlPath": ".item a",
        "urlPathAttr": "href",
        "urlFormat": "https://example.com/product{0}",
        "includeTitle": true,
        "includeMetadata": true
    }
]
The contents of this file must be valid JSON, starting with a top-level array. You should check your file with a JSON validator before uploading. If you upload a malformed JSON file your data source update will fail.

The JSON fields are:

  • url: URL against which split pattern will be applied. The URL must be an exact match to the URL of the source document being processed for the pattern to apply. The URL must be unique. If the same URL is defined more than once the last pattern read will apply.

  • splitPattern: Pattern used to split the XML/HTML document. For XML documents the split pattern must be defined as a valid XPath 1.0 expression. Note: because namespaces are disabled, queries can be expressed using the element’s local name only. For HTML documents the split pattern must be a valid JSoup document selector.

  • urlPath: (Optional) Pattern used to locate the source of the URL to assign in the split document. For XML documents the pattern must be defined a valid XPath. For HTML documents the pattern must be a valid JSoup document selector.

  • urlPathAttr: (Optional, is only effective if urlPath is set and the document is HTML document) The Jsoup attribute to be used in conjunction with the urlPath selector to use when assigning the document URL.

  • urlFormat: (Optional, is only effective if urlPath is set) Defines the default format of the generated URL using Java Message Format.

  • includeTitle: (Optional, is only effective if the document is an HTML document) If set to true the title of the HTML document will be included in the split document. This will override the global configuration setting - Default include HTML title for all split HTML documents.

  • includeMetadata: (Optional, is only effective if the document is an HTML document) If set to true the metadata of the HTML document will be included in the split document. This will override the global configuration setting - Default include HTML metadata for all split HTML documents.

If you upload this file via WebDAV then this file should be saved within the conf/plugin-configuration/split-html-xml-filter directory of the data source.

Examples

The example below shows configuration of the XML or HTML split plugin to split a specific XML file and HTML file if found during the update.

Consider a web data source that crawls a set of files including the XML and HTML files shown below:

XML file: http://example.com/bookstore.xml
<?xml version="1.0" encoding="UTF-8"?>
<libraries>
    <library>
        <books>
            <book>
                <name>Book1</name>
                <link>/Book1.html</link>
            </book>
            <book>
                <name>Mybook1</name>
                <link>/MyBook1.html</link>
            </book>
        </books>
    </library>
    <library>
        <books>
            <book>
                <name>Book2</name>
                <link>/Book2.html</link>
            </book>
            <book>
                <name>Mybook2</name>
                <link>/MyBook2.html</link>
            </book>
        </books>
    </library>
</libraries>
HTML file: http://example.com/itemdirectory/search.html
<!DOCTYPE html>
<html>
    <head>
        <title>List of items</title>
        <meta name="type" content="examples"/>
    </head>
    <body>
        <div>
            <ul class="item-list">
                <li class="item"> <a href="/items1.html">Item 1</a> </li>
                <li class="item"> <a href="/items2.html">Item 2</a> </li>
                <li class="item"> <a href="/items3.html">Item 3</a> </li>
                <li class="item"> <a href="/items4.html">Item 4</a> </li>
                <li class="item"> <a href="/items5.html">Item 5</a> </li>
                <li class="item"> <a href="/items6.html">Item 6</a> </li>
            </ul>
        </div>
    </body>
</html>

The web data source can be configured to split both of these pages so that the <book> elements from the XML file and the <li class="item"> items from the HTML are all indexed as separate items.

By default, the split HTML documents will include the <title> in each generated document. This can be turned off by setting Default include HTML title for all split documents to no when configuring your plugin. Similarly, you can include the metadata from <meta> tags by setting Default include HTML metadata for all split documents to yes when configuring your plugin.

Setting the following rules in the plugin configuration:

Plugin configuration file: rules.json
[
    {
        "url": "http://example.com/bookstore.xml",
        "splitPattern": "/libraries/library/books/book",
        "urlPath": "//link",
        "urlFormat": "https://example.com/bookstore/books{0}"
    },
    {
        "url": "http://example.com/itemdirectory/search.html",
        "splitPattern": ".item-list .item",
        "urlPath": ".item a",
        "urlPathAttr": "href",
        "urlFormat": "https://example.com/product{0}",
        "includeTitle": true,
        "includeMetadata": true
    }
]

will result in two items being added to the index from the XML file, and six from the HTML file.

Split an XML file with elements that contain a namespace

Consider a web data source that crawls a set of files including the XML file shown below:

XML file: http://example.com/bookstore.xml
<?xml version="1.0" encoding="UTF-8"?>
<libraries xmlns:ns="http://www.example.com">
    <ns:library>
        <ns:books>
            <ns:book>
                <name>Book1</name>
                <link>/Book1.html</link>
            </ns:book>
            <ns:book>
                <name>Mybook1</name>
                <link>/MyBook1.html</link>
            </ns:book>
        </ns:books>
    </ns:library>
    <ns:library>
        <ns:books>
            <ns:book>
                <name>Book2</name>
                <link>/Book2.html</link>
            </ns:book>
            <ns:book>
                <name>Mybook2</name>
                <link>/MyBook2.html</link>
            </ns:book>
        </ns:books>
    </ns:library>
</libraries>
The XPath used for the Default XPath for splitting XML documents should not include the namespace. For example, the Default XPath for splitting XML documents for the XML file above should be /libraries/library/books/book instead of /libraries/ns:library/ns:books/ns:book.

Change log

[3.0.0]

Added

  • Added the ability to include a source document’s <meta> tags in split documents when splitting an HTML document.

  • Added the ability to process an XML file with namespaced XML fields.

  • Added the ability to configure Default include HTML title for all split HTML documents and Default include HTML metadata for all split HTML documents in a configuration file.

Changed

  • Upgraded the XPath processing to use native Jsoup XPath support, enabling the processing of namespaced XML tags. Existing XPaths should continue to work, however, due to the change in the underlying implementation of how XPaths are handled we recommend you carefully check your search after upgrading to this version.

[2.0.0]

Changed

  • Additional metadata, appended to item records: X-Funnelback-SplitHtmlXmlOriginalUrl, X-Funnelback-SplitHtmlXmlSplitPath and X-Funnelback-SplitHtmlXmlExtractedItem is no longer inserted into the document content but provided as header metadata. If you have mapped these values then you may need to update your metadata mappings to reference this data from headers instead.

Fixed

  • Fixed a bug where the Split XML plugin was not generating valid XML.

[1.3.0]

Changed

  • Added ability to exclude the HTML title tag

[1.2.0]

Changed

  • Upgraded Jsoup dependency from v1.14.3 to 1.17.1

[1.1.0]

Changed

  • Updated to the latest version plugin framework (Funnelback shared v16.20) to enable integration with the new plugin management dashboard.

[1.0.1]

Added

  • Added the ability to source the URL from a location within the split document instead of receiving and auto-assigned URL.