Plugin: Strip HTML markup from XML
Purpose
Use this plugin if you need to strip HTML markup tags from XML document content.
This plugin also provides options to exclude specific XML fields, HTML tags and attributes when stripping the HTML.
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Strip HTML markup from XML tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
Ignored HTML tags
Configuration key |
|
Data type |
array |
Default value |
|
Required |
This setting is optional |
List of HTML tags that will be ignored when stripping HTML tags from the content. The listed tags will remain after the stripping process is complete.
Ignored HTML attributes
Configuration key |
|
Data type |
array |
Default value |
|
Required |
This setting is optional |
List of attributes that will be ignored for the HTML field specified in the Parameter 1 field. The listed attributes will remain after the stripping process is complete.
Excluded XML fields
Configuration key |
|
Data type |
array |
Default value |
|
Required |
This setting is optional |
List of XML fields to skip when processing. Any HTML contained within these XML tags will remain after the stripping process is complete.
Filter chain configuration
This plugin uses filters which are used to apply transformations to the gathered content.
The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.
Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation. |
Examples
Applying the plugin to clean the following XML field:
<testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]></testField>
results in the following clean XML:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField>
This is my paragraph. Here is the link Test link
</testField>
</item>
If you exclude <a>
and <p>
tags:
Configuration key name | Value |
---|---|
Ignored HTML tags |
|
We get the following XML after processing:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField>
This is my <p>paragraph</p>. Here is the link <a>Test link</a>
</testField>
</item>
The <a> tag is preserved, but has all attributes removed.
|
If you want to keep the <a>
tag along with href
attribute:
Configuration key name | Parameter 1 | Value |
---|---|---|
Ignored HTML tags |
|
|
Ignored HTML attributes |
|
|
You can skip the tag definition Ignore HTML tags with setting
|
we get the following XML after processing:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField>
This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>
</testField>
</item>
<![CDATA[ ]]> tags are removed from all processed XML fields and the content is escaped. This behavior is not configurable.
|
If we don’t want to process the <testField>
element:
Configuration key name | Value |
---|---|
Excluded XML fields |
|
we get the following XML after processing:
<?xml version="1.0" encoding="UTF-8"?>
<item>
<testField><![CDATA[This is my <p>paragraph</p>. Here is the link <a href="https://my.link">Test link</a>]]>
</testField>
</item>