Plugin: Strip HTML markup
Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin. |
Purpose
Strips HTML tags from XML document content.
This plugin also provides options to exclude specific tags and attributes when stripping the tags.
Usage
Enable the plugin
Enable the strip-html-tags-filter plugin on your data source from the plugins screen in the search dashboard or add the following data source configuration to enable the plugin.
plugin.strip-html-tags-filter.enabled=true
plugin.strip-html-tags-filter.version=1.0.0
Add com.funnelback.plugins.striphtmltagsfilter.StripHtmlTagsFilter
to the filter chain:
filter.classes=<OTHER-FILTERS>:com.funnelback.plugins.striphtmltagsfilter.StripHtmlTagsFilter:<OTHER-FILTERS>
The filter should be placed at an appropriate position in the filter chain (which applies the filters from left to right). In most circumstances this should be located towards the end of the filter chain. The plugin will take effect after a full update of the data source. |
Plugin configuration settings
The following options can be set in the data source configuration to configure the plugin:
-
plugin.strip-html-tags-filter.config.exclude-html-tags
: comma separated list of HTML tags, that will be ignored and not stripped from the content. The listed tags will remain after the stripping process is complete. -
plugin.strip-html-tags-filter.config.exclude-html-attributes.<HTML_TAG>
: comma separated list of attributes to ignore on the specifiedHTML_TAG
. The listed attributes will remain after the stripping process is complete. -
plugin.strip-html-tags-filter.config.exclude-xml-fields
: comma separated list of XML fields to skip when processing. Any HTML contained within these XML tags will remain after the stripping process is complete.
Example
Applying the plugin to clean the following XML field:
<testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]></testField>
results in the following clean XML:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField>
This is my paragraph. Here is the link Test link
</testField>
</item>
If you exclude <a>
and <p>
tags:
plugin.strip-html-tags-filter.config.exclude-html-tags=a,p
we get the following XML after processing:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField>
This is my <p>paragraph</p>. Here is the link <a>Test link</a>
</testField>
</item>
The <a> tag is preserved, but has all attributes removed.
|
If you want to keep the <a>
tag along with href
attribute:
plugin.strip-html-tags-filter.config.exclude-html-tags=a,p plugin.strip-html-tags-filter.config.exclude-html-attributes.a=href
You can skip the tag definition exclude-html-tags=a if you are going to exclude attributes for this tag exclude-html-attributes.a=href
|
we get the following XML after processing:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField>
This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>
</testField>
</item>
please be aware that for all processed XML fields <![CDATA[ ]]> is being removed and all content is being escaped.
|
If we don’t want to process the <testField>
element:
plugin.strip-html-tags-filter.config.exclude-xml-fields=testField
we get the following XML after processing:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]>
</testField>
</item>