Plugin: Split XML or HTML
Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin. |
Usage
Enable the plugin
Enable the split-html-xml-filter plugin on your data source from the extensions screen in the administration dashboard or add the following data source configuration to enable the plugin:
plugin.split-html-xml-filter.enabled=true
plugin.split-html-xml-filter.version=1.0.1
Add splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter
to the filter chain:
filter.classes=<OTHER-FILTERS>:com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter:<OTHER-FILTERS>
The com.funnelback.plugin.splitxmlhtmlfilter.SplitHtmlXmlFilterStringFilter should be placed at an appropriate position in the filter chain. In most circumstances this should be located towards the start of the filter chain, but after any filters that convert source content into XML.
|
The plugin will take effect after a full update of the data source.
Plugin configuration settings
The following options can be set in the data source configuration to configure the plugin:
-
plugin.split-html-xml-filter.config.defaultXMLSplit
: This field holds a default split pattern for XML files processed by the update. If the JSON config file is missing, or there are not matching URLs within the JSON configuration then this default split pattern will be used. The default split path must be a valid XPath. -
plugin.split-html-xml-filter.config.defaultHTMLSplit
: This field holds a default split pattern for HTML documents processed by the update. If the JSON config file is missing, or there are not matching URLs within the JSON configuration then this default split pattern will be used. The default split path must be a valid Jsoup selector, which is similar to a CSS selector. -
plugin.split-html-xml-filter.config.keepOriginal
: This flag is used to determine if original document needs to be included in the split result documents set. It has default value offalse
, which means original document will not be included as a part of split result documents. -
plugin.split-html-xml-filter.config.defaultXMLUrlPath
: (Optional) This field sets a default location from which to source the URL. The location is relative to a document after it has been split. If an absolute XPATH is used, it must be based on the split document instead of the original document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents. The default value defined in this key will be used if there is no JSON config file or there are no matching URLs within the JSON configuration. The default URL path must be a valid XPath. -
plugin.split-html-xml-filter.config.defaultHTMLUrlPath
:(Optional) This field sets a default location from which to source the URL. The location is relative to a document after it has been split. If an absolute path is used, it must be based on the split document. If this is not set, an auto-generated URL based on the document URL will be assigned to the split documents. The default value defined in this key will be used if there is no JSON config file or there are no matching URLs within the JSON configuration. The default URL path must be a valid Jsoup selector, which is similar to a CSS selector. -
plugin.split-html-xml-filter.config.defaultHtmlUrlPathAttribute
: (Optional, is only effective ifplugin.split-html-xml-filter.config.defaultHTMLUrlPath
is set) This field sets a default attribute of theplugin.split-html-xml-filter.config.defaultHTMLUrlPath
Jsoup selector to get the URL. If this is not set, it will use the text inside the Jsoup selector. If the JSON config file is missing, or there are no matching URLs within the JSON configuration, then this default Jsoup selector attribute will be used. -
plugin.split-html-xml-filter.config.defaultXMLUrlFormat
: (Optional, is only effective ifplugin.split-html-xml-filter.config.defaultXMLUrlPath
is set) This field holds a default format of the generated URL. It uses Java MessageFormat to generate the text. The default format is{0}
. If the JSON config file is missing, or there are no matching URLs within the JSON configuration then this default URL format will be used. -
plugin.split-html-xml-filter.config.defaultHTMLUrlFormat
: (Optional, is only effective ifplugin.split-html-xml-filter.config.defaultHTMLUrlPath
is set) This field holds a default format of the generated URL. It uses Java MessageFormat to generate the text. The default format is{0}
. If the JSON config file is missing, or there are not matching URLs within the JSON configuration, then this default URL format will be used.
If the XPath/Jsoup selector element does not exist or the generated URL is not a valid absolute URL, it will fall back to the default URL based on the document URL. |
Alternatively, the plugin can be configured to run on individual URLs through the use of a configuration file named plugin.split-html-xml-filter.rules.json
, saved within the conf/plugin-configuration/split-html-xml-filter
directory of the data source. This file can currently only be edited using WebDAV.
The content of file must be valid JSON, starting with a top-level array. The JSON fields are:
-
url
: URL against which split pattern will be applied. The URL must be an exact match to the URL of the document being processed for the pattern to apply. The URL must be unique. If the same URL is defined more than once the last pattern read will apply. -
splitPattern
: Pattern used to split the xml/html document. -
urlPath
: (Optional) Pattern used to locate the path of the URL in the split document. -
urlPathAttr
: (Optional, is only effective ifurlPath
is set and the document is HTML document) The attribute of the urlPath selector to use for the path of URL. -
urlFormat
: (Optional, is only effective ifurlPath
is set) The format of the URL.
Examples
Example source input and configuration files are shown below:
Example: XML source input file
Example URL: http://example.com/bookstore.xml
<libraries>
<library>
<books>
<book>
<name>Book1</name>
<link>/Book1.html</link>
</book>
<book>
<name>Mybook1</name>
<link>/MyBook1.html</link>
</book>
</books>
</library>
<library>
<books>
<book>
<name>Book2</name>
<link>/Book2.html</link>
</book>
<book>
<name>Mybook2</name>
<link>/MyBook2.html</link>
</book>
</books>
</library>
</libraries>
Example: HTML source input file
Example URL: http://example.com/itemdirectory/search.html
<html>
<body>
<div>
<ul class="item-list">
<li class="item"> <a href="/items1.html">Item 1</a> </li>
<li class="item"> <a href="/items2.html">Item 2</a> </li>
<li class="item"> <a href="/items3.html">Item 3</a> </li>
<li class="item"> <a href="/items4.html">Item 4</a> </li>
<li class="item"> <a href="/items5.html">Item 5</a> </li>
<li class="item"> <a href="/items6.html">Item 6</a> </li>
</ul>
</div>
</body>
</html>
Example plugin configuration
Configuration file: plugin.split-html-xml-filter.rules.json
[
{
"url": "http://example.com/bookstore.xml",
"splitPattern": "/libraries/library/books/book",
"urlPath": "//link",
"urlFormat": "https://example.com/bookstore/books{0}"
},
{
"url": "http://example.com/itemdirectory/search.html",
"splitPattern": ".item-list .item",
"urlPath": ".item a",
"urlPathAttr": "href",
"urlFormat": "https://example.com/product{0}"
}
]
The above configuration will result in two items being added to the index from the XML file, and six from the HTML file.