Plugin: Strip HTML markup from XML

Purpose

Use this plugin if you need to strip HTML markup tags from XML document content.

This plugin also provides options to exclude specific XML fields, HTML tags and attributes when stripping the HTML.

Usage

Enable the plugin

Select Plugins from the side navigation pane and click on the Strip HTML markup from XML tile.
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Ignored HTML tags

Configuration key

plugin.strip-html-tags-filter.config.exclude-html-tags

Data type

array

Default value

an empty list

Required

This setting is optional

List of HTML tags that will be ignored when stripping HTML tags from the content. The listed tags will remain after the stripping process is complete.

Ignored HTML attributes

Configuration key

plugin.strip-html-tags-filter.config.exclude-html-attributes.*

Data type

array

Default value

an empty list

Required

This setting is optional

List of attributes that will be ignored for the HTML field specified in the Parameter 1 field. The listed attributes will remain after the stripping process is complete.

Excluded XML fields

Configuration key

plugin.strip-html-tags-filter.config.exclude-xml-fields

Data type

array

Default value

an empty list

Required

This setting is optional

List of XML fields to skip when processing. Any HTML contained within these XML tags will remain after the stripping process is complete.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugins.striphtmltagsfilter.StripHtmlTagsFilter

Drag the com.funnelback.plugins.striphtmltagsfilter.StripHtmlTagsFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

Applying the plugin to clean the following XML field:

<testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]></testField>

results in the following clean XML:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField>
        This is my paragraph. Here is the link Test link
    </testField>
</item>

If you exclude <a> and <p> tags:

Configuration key name Value

Configuration key name	Value
Ignored HTML tags	`a,p`

Ignored HTML tags

a,p

We get the following XML after processing:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField>
        This is my &lt;p&gt;paragraph&lt;/p&gt;. Here is the link &lt;a&gt;Test link&lt;/a&gt;
    </testField>
</item>

The <a> tag is preserved, but has all attributes removed.

If you want to keep the <a> tag along with href attribute:

Configuration key name Parameter 1 Value

Configuration key name	Parameter 1	Value
Ignored HTML tags		`a,p`
Ignored HTML attributes	`a`	`href`

Ignored HTML tags

a,p

Ignored HTML attributes

a

href

You can skip the tag definition Ignore HTML tags with setting a if you are going to exclude attributes for this tag

Configuration key name Parameter 1 Value

Ignored HTML attributes

a

href

we get the following XML after processing:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField>
        This is my &lt;p&gt;paragraph&lt;/p&gt;. Here is the link &lt;a href="https://my.link"&gt;Test link&lt;/a&gt;
    </testField>
</item>

<![CDATA[ ]]> tags are removed from all processed XML fields and the content is escaped. This behavior is not configurable.

If we don’t want to process the <testField> element:

Configuration key name Value

Configuration key name	Value
Excluded XML fields	`testField`

Excluded XML fields

testField

we get the following XML after processing:

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]>
    </testField>
</item>

Change log

[1.2.1]

Changed

Upgraded Jsoup dependency from v1.16.2 to 1.19.1

[1.2.0]

Fixed

Fixed an issue where extra whitespaces would be added before closing XML tag.

[1.1.0]

Changed

Updated to the latest version plugin framework (Funnelback shared v16.20) to enable integration with the new plugin management dashboard.

Help Center

Menu

Plugin: Strip HTML markup from XML

Purpose

Usage

Enable the plugin

Configuration settings

Ignored HTML tags

Ignored HTML attributes

Excluded XML fields

Filter chain configuration

Filter classes

Examples

Change log

[1.2.1]

Changed

[1.2.0]

Fixed

[1.1.0]

Changed

See also