Plugin: Clean text input
Purpose
Use this plugin to remove non-printable or control characters from downloaded text content.
This plugin can be helpful when the Funnelback web crawler reports finding binary/control characters in leading part of URL, such as a zero-width non-breaking space character which may be sent by some servers.
When enabled the plugin will strip out any non-printable and control characters and then pass this on to subsequent filters. As a result you should place this filter at the start of your filter chain in most cases.
The plugin also enables you to normalize text containing diacritic characters, replacing the diacritic characters with near-equivalents.
Diacritic character normalization is basic and will only do simple character replacements such as é with e. For example, it won’t perform German substitutions such as replacing ß with ss, or ö with oe. |
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Clean text input tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
Remove non-printable characters
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is optional |
Removes any non-printable characters present in downloaded text content.
Clean diacritic (accented) characters
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is optional |
Cleans diacritic/accented characters present in downloaded text content replacing the characters with non-accented equivalents.
Filter chain configuration
This plugin uses filters which are used to apply transformations to the gathered content.
The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.
Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation. |
Examples
Example: remove non-printable characters
The web crawler logs contain the following message
Detected binary/control characters in leading part of URL
when attempting to index an XML file.
Investigating the returned file (e.g. with a command line tool such as curl) shows that the XML file returned contains a binary character that precedes the opening <?xml>
tag, which may look something like this in your text editor.
<X+FEFF><?xml version="1.0" encoding="UTF-8"?> <item> ...
This plugin can remove this character - enable the plugin, configure it to remove unprintable characters and add the filter to the start of your filter chain.
Run a full update of your data source and the control character will be removed.
Example: normalize text
In this example you are indexing some text content that contains accented characters that you wish to be indexed without the accents.
... The Nobel Prize in Literature 1947 was awarded to André Gide for his comprehensive and artistically significant writings, in which human problems and conditions have been presented with a fearless love of truth and keen psychological insight. ...
This plugin can remove the accented characters - enable the plugin and configure it to clean diacritics and add the filter into your filter chain.
Run a full update of your data source and the text will be indexed as
... The Nobel Prize in Literature 1947 was awarded to Andre Gide for his comprehensive and artistically significant writings, in which human problems and conditions have been presented with a fearless love of truth and keen psychological insight. ...