Plugin: Split XML or HTML
Purpose
Use this plugin if you have XML or HTML documents that you wish to split and index as separate search result items. Splits HTML and XML documents based on X-Path or CSS selector patterns.
Common usage scenarios
- You have an XML file that containing multiple items that should be indexed separately
-
Consider the following XML fragment that contains multiple book items:
<?xml version="1.0" encoding="utf-8" ?>
<library>
<book>
<title>To Kill A Mockingbird</title>
<author>Harper Lee</author>
<url>http://examplelib.com/classics/089748543343.html</url>
</book>
<book>
<title>Animal Farm</title>
<author>George Orwell</author>
<url>http://examplelib.com/classics/57234677556.html</url>
</book>
<book>
<title>Jane Eyre</title>
<author>Charlotte Brontë</author>
<url>http://examplelib.com/classics/466666367563.html</url>
</book>
...
</library>
The plugin can be used to split this XML file into individual book items, each of which would appear in the search as a distinct result.
- You have an HTML page that lists out multiple items that should be indexed separately
-
Consider the following HTML page that lists out some upcoming events.
...
<li class="event">
<h4>Spring in the park</h4>
<cite>City Park, 5 Sep 2023, all day</cite>
<p>
Spend a relaxing day out in the park, with live music and food.</p>
</li>
<li class="event">
<h4>George Gershwin in concert</h4>
<cite>City Concert Hall, 5 Sep 2023, 8pm</cite>
<p>
Spend a relaxing day out in the park, with live music and food.</p>
<p>
Tickets: <a href="https://eventtickets.com/event/5757434255693278875">Buy tickets</a></p>
</li>
<li class="event">
...
The plugin can be configured to index each of these event entries as separate items by using a CSS style selector to target the list of event items.
Splitting JSON
This plugin can also be used to split JSON documents by combining it with the JSONToXML built-in filter
To set this up you must ensure that the JSONToXML
filter is set in your filter chain (filter.classes
), and when you configure this plugin ensure that the SplitHtmlXmlFilterStringFilter
filter runs after the JSONToXML
filter.
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Split XML or HTML tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
Keep the original document
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is optional |
Should the original (source) document be kept and included in the index? The original document is usually discarded, however it can be retained by setting this to 'Yes'.
Default XPath for splitting XML documents
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines an XPath, relative to the source document root, which is used to split any XML files processed during the update.
Any matching rules in the JSON configuration file will override this default value. The split path must be a valid XPath.
Default XPath for sourcing the URL of split XML documents
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
This XPath defines the location, relative to the split document root, from which to source the split document’s URL.
If an absolute XPath is used, it must be based on the split document instead of the original document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents.
Any matching rules in the JSON configuration file will override this default value. The default XML URL path must be a valid XPath.
Default URL format for XML
Configuration key |
|
Data type |
string |
Default value |
|
Required |
This setting is optional |
Defines the default format of the generated URL using Java Message Format.
Any matching rules in the JSON configuration file will override this default value.
This setting only applies if the default XPath for sourcing the URL of split XML documents is set.
Default Jsoup selector for splitting HTML documents
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines a Jsoup selector, relative to the source document root, which is used to split any HTML files processed during the update.
Any matching rules in the JSON configuration file will override this default value. The split path must be a valid Jsoup selector, which is similar to a CSS selector.
Default Jsoup selector for sourcing the URL of split HTML documents
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
This Jsoup selector defines location, relative to the split document’s root, from which to source the split document’s URL.
If an absolute selector is used, it must be based on the split document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents.
Any matching rules in the JSON configuration file will override this default value. The split path must be a valid Jsoup selector, which is similar to a CSS selector.
Default Jsoup attribute for sourcing the URL of split HTML documents
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the attribute to use in conjunction with the defined Jsoup selector when assigning the document URL.
If this is not set, it will use the text inside the Jsoup selector.
Any matching rules in the JSON configuration file will override this default value.
Default URL format for HTML
Configuration key |
|
Data type |
string |
Default value |
|
Required |
This setting is optional |
Defines the default format of the generated URL using Java Message Format.
Any matching rules in the JSON configuration file will override this default value.
This setting only applies if the default Jsoup selector for sourcing the URL of split HTML documents is set.
Filter chain configuration
This plugin uses filters which are used to apply transformations to the gathered content.
The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.
Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation. |
Filter classes
This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter
Drag the com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter plugin filter to where you wish it to run in the filter chain sequence.
Configuration files
This plugin also uses the following configuration files to provide additional configuration.
rules.json
Description |
Use this configuration file if you need to set split rules that apply to specific URLs. |
Configuration file format |
json |
rules.json
[
{
"url": "http://example.com/bookstore.xml",
"splitPattern": "/libraries/library/books/book",
"urlPath": "//link",
"urlFormat": "https://example.com/bookstore/books{0}"
},
{
"url": "http://example.com/itemdirectory/search.html",
"splitPattern": ".item-list .item",
"urlPath": ".item a",
"urlPathAttr": "href",
"urlFormat": "https://example.com/product{0}"
}
]
The contents of this file must be valid JSON, starting with a top-level array. You should check your file with a JSON validator before uploading. If you upload a malformed JSON file your data source update will fail. |
The JSON fields are:
-
url
: URL against which split pattern will be applied. The URL must be an exact match to the URL of the source document being processed for the pattern to apply. The URL must be unique. If the same URL is defined more than once the last pattern read will apply. -
splitPattern
: Pattern used to split the XML/HTML document. For XML documents the split pattern must be defined a valid XPath. For HTML documents the split pattern must be a valid JSoup document selector. -
urlPath
: (Optional) Pattern used to locate the source of the URL to assign in the split document. For XML documents the pattern must be defined a valid XPath. For HTML documents the pattern must be a valid JSoup document selector. -
urlPathAttr
: (Optional, is only effective ifurlPath
is set and the document is HTML document) The Jsoup attribute to be used in conjunction with theurlPath
selector to use when assigning the document URL. -
urlFormat
: (Optional, is only effective ifurlPath
is set) Defines the default format of the generated URL using Java Message Format.
If you upload this file via WebDAV then this file should be saved within the conf/plugin-configuration/split-html-xml-filter directory of the data source.
|
Examples
The example below shows configuration of the XML or HTML split plugin to split a specific XML file and HTML file if found during the update.
Consider a web data source that crawls a set of files including the XML and HTML files shown below:
http://example.com/bookstore.xml
<libraries>
<library>
<books>
<book>
<name>Book1</name>
<link>/Book1.html</link>
</book>
<book>
<name>Mybook1</name>
<link>/MyBook1.html</link>
</book>
</books>
</library>
<library>
<books>
<book>
<name>Book2</name>
<link>/Book2.html</link>
</book>
<book>
<name>Mybook2</name>
<link>/MyBook2.html</link>
</book>
</books>
</library>
</libraries>
http://example.com/itemdirectory/search.html
<html>
<body>
<div>
<ul class="item-list">
<li class="item"> <a href="/items1.html">Item 1</a> </li>
<li class="item"> <a href="/items2.html">Item 2</a> </li>
<li class="item"> <a href="/items3.html">Item 3</a> </li>
<li class="item"> <a href="/items4.html">Item 4</a> </li>
<li class="item"> <a href="/items5.html">Item 5</a> </li>
<li class="item"> <a href="/items6.html">Item 6</a> </li>
</ul>
</div>
</body>
</html>
The web data source can be configured to split both of these pages so that the <book>
elements from the XML file and the <li class="item">
items from the HTML are all indexed as separate items.
Setting the following rules in the plugin configuration:
rules.json
[
{
"url": "http://example.com/bookstore.xml",
"splitPattern": "/libraries/library/books/book",
"urlPath": "//link",
"urlFormat": "https://example.com/bookstore/books{0}"
},
{
"url": "http://example.com/itemdirectory/search.html",
"splitPattern": ".item-list .item",
"urlPath": ".item a",
"urlPathAttr": "href",
"urlFormat": "https://example.com/product{0}"
}
]
will result in two items being added to the index from the XML file, and six from the HTML file.