Plugin: Wrap XML element in html tags

Purpose

This plugin provides users with the ability to wrap specific XML elements in <html>...<html> tags, so that they can be indexed by PADRE as inner HTML documents. This is useful when an XML or JSON feed provided by a client contains nested HTML.

When to use this plugin

This plugin should be used if you are trying to index an XML file using the HTML inner document mode, if the XML field you are attempting to index contains bare HTML code (without surrounding <html> tags). If you attempt to index this without the plugin padre will fail to index the inner document as HTML.

Usage

Enable the plugin

Enable the xml-element-html-wrapper-filter plugin on your data source from the Extensions screen in the administration dashboard or add the following data source configuration to enable the plugin.

plugin.xml-element-html-wrapper-filter.enabled=true
plugin.xml-element-html-wrapper-filter.version=1.0.0
This plugin requires a full update of the data source to take effect.

Plugin configuration settings

The XmlElementHtmlWrapperFilter filter must be added to the filter chain for the plugin to work correctly:

Add the filter to the filter.classes in the data source configuration.

filter.classes=<OTHER-FILTERS>:com.funnelback.plugins.xmlElementHtmlWrapperFilter.XmlElementHtmlWrapperFilter:<OTHER-FILTERS>
The filter should be placed at an appropriate position in the filter chain. In most circumstances this should be located towards the end of the filter chain.

The following option must be set in the data source configuration to configure the plugin:

  • plugin.xml-element-html-wrapper-filter.config.xpath=/xpath/to/html/to/wrap: Defines the XPath of the XML element containing HTML that should have its content wrapped in <html>...</html> tags.

When defining the X-Path, ensure you take into account any changes in the XML structure that might have been introduced in previous filters. For example if you split the XML file into individual XML documents that you then filter, the X-Paths will need to be adjusted to reflect the individual XML record structure.

Example

Consider the following XML document:

<files>
  <file>
    <title>Example title 1</title>
    <description>This is an example document</description>
    <url>http://example.com/example-files/file.html</url>
    <doc>&lt;p&gt;An example HTML document.&lt;/p&gt;></doc>
  </file>
  <file>
    <title>Example title 2</title>
    <description>This is another example document</description>
    <url>http://example.com/example-files/file2.html</url>
    <doc><![CDATA[<p>Another example HTML document.</p>]]></doc>
  </file>
</files>

The doc element contains html code that you wish to index as a html document.

Configuring the plugin with:

plugin.xml-element-html-wrapper-filter.config.xpath=/files/file/doc

will result in the following XML file being produced for Padre to index.

<files>
  <file>
    <title>Example title 1</title>
    <description>This is an example document</description>
    <url>http://example.com/example-files/file.html</url>
    <doc>&lt;html&gt;&lt;p&gt;An example HTML document.&lt;/p&gt;&lt;/html&gt;</doc>
  </file>
  <file>
    <title>Example title 2</title>
    <description>This is another example document</description>
    <url>http://example.com/example-files/file2.html</url>
    <doc>&lt;html&gt;&lt;p&gt;Another example HTML document.&lt;/p&gt;&lt;/html&gt;</doc>
  </file>
</files>

Caveats

  • Applying this plugin can have unintended consequences when used in conjunction with the Inner HTML or XML documents in the Funnelback special XML configuration. If the Inner HTML or XML documents feature is used, the field containing the Document URL must be above all fields containing inner HTML or XML documents.