Plugin: Strip HTML markup

Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin.

Purpose

Strips HTML tags from XML document content.

This plugin also provides options to exclude specific tags and attributes when stripping the tags.

Usage

Enable the plugin

Enable the strip-html-tags-filter plugin on your data source from the plugins screen in the search dashboard or add the following data source configuration to enable the plugin.

plugin.strip-html-tags-filter.enabled=true
plugin.strip-html-tags-filter.version=1.0.0

Add com.funnelback.plugins.striphtmltagsfilter.StripHtmlTagsFilter to the filter chain:

filter.classes=<OTHER-FILTERS>:com.funnelback.plugins.striphtmltagsfilter.StripHtmlTagsFilter:<OTHER-FILTERS>
The filter should be placed at an appropriate position in the filter chain (which applies the filters from left to right). In most circumstances this should be located towards the end of the filter chain. The plugin will take effect after a full update of the data source.

Plugin configuration settings

The following options can be set in the data source configuration to configure the plugin:

  • plugin.strip-html-tags-filter.config.exclude-html-tags: comma separated list of HTML tags, that will be ignored and not stripped from the content. The listed tags will remain after the stripping process is complete.

  • plugin.strip-html-tags-filter.config.exclude-html-attributes.<HTML_TAG>: comma separated list of attributes to ignore on the specified HTML_TAG. The listed attributes will remain after the stripping process is complete.

  • plugin.strip-html-tags-filter.config.exclude-xml-fields: comma separated list of XML fields to skip when processing. Any HTML contained within these XML tags will remain after the stripping process is complete.

Example

Applying the plugin to clean the following XML field:

<testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]></testField>

results in the following clean XML:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField>
        This is my paragraph. Here is the link Test link
    </testField>
</item>

If you exclude <a> and <p> tags:

plugin.strip-html-tags-filter.config.exclude-html-tags=a,p

we get the following XML after processing:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField>
        This is my &lt;p&gt;paragraph&lt;/p&gt;. Here is the link &lt;a&gt;Test link&lt;/a&gt;
    </testField>
</item>
The <a> tag is preserved, but has all attributes removed.

If you want to keep the <a> tag along with href attribute:

plugin.strip-html-tags-filter.config.exclude-html-tags=a,p
plugin.strip-html-tags-filter.config.exclude-html-attributes.a=href
You can skip the tag definition exclude-html-tags=a if you are going to exclude attributes for this tag exclude-html-attributes.a=href

we get the following XML after processing:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField>
        This is my &lt;p&gt;paragraph&lt;/p&gt;. Here is the link &lt;a href="https://my.link"&gt;Test link&lt;/a&gt;
    </testField>
</item>
please be aware that for all processed XML fields <![CDATA[ ]]> is being removed and all content is being escaped.

If we don’t want to process the <testField> element:

plugin.strip-html-tags-filter.config.exclude-xml-fields=testField

we get the following XML after processing:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]>
    </testField>
</item>

All versions of strip-html-tags-filter