Plugin: Clean text input

Purpose

Use this plugin to remove non-printable or control characters from downloaded text content.

This plugin can be helpful when the Funnelback web crawler reports finding binary/control characters in leading part of URL, such as a zero-width non-breaking space character which may be sent by some servers.

When enabled the plugin will strip out any non-printable and control characters and then pass this on to subsequent filters. As a result you should place this filter at the start of your filter chain in most cases.

The plugin also enables you to normalize text containing diacritic characters, replacing the diacritic characters with near-equivalents.

Diacritic character normalization is basic and will only do simple character replacements such as é with e. For example, it won’t perform German substitutions such as replacing ß with ss, or ö with oe.

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Clean text input tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Remove non-printable characters

Configuration key

plugin.clean-text-input.config.removeNonPrintableChars

Data type

boolean

Default value

false

Required

This setting is optional

Removes any non-printable characters present in downloaded text content.

Clean diacritic (accented) characters

Configuration key

plugin.clean-text-input.config.cleanDiacritics

Data type

boolean

Default value

false

Required

This setting is optional

Cleans diacritic/accented characters present in downloaded text content replacing the characters with non-accented equivalents.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.CleanTextInput.CleanTextInputStringFilter

Drag the com.funnelback.plugin.CleanTextInput.CleanTextInputStringFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

Example: remove non-printable characters

The web crawler logs contain the following message

Detected binary/control characters in leading part of URL

when attempting to index an XML file.

Investigating the returned file (e.g. with a command line tool such as curl) shows that the XML file returned contains a binary character that precedes the opening <?xml> tag, which may look something like this in your text editor.

<X+FEFF><?xml version="1.0" encoding="UTF-8"?>
<item>
...

This plugin can remove this character - enable the plugin, configure it to remove unprintable characters and add the filter to the start of your filter chain.

Run a full update of your data source and the control character will be removed.

Example: normalize text

In this example you are indexing some text content that contains accented characters that you wish to be indexed without the accents.

...
The Nobel Prize in Literature 1947 was awarded to André Gide for his comprehensive and artistically significant writings, in which human problems and conditions have been presented with a fearless love of truth and keen psychological insight.
...

This plugin can remove the accented characters - enable the plugin and configure it to clean diacritics and add the filter into your filter chain.

Run a full update of your data source and the text will be indexed as

...
The Nobel Prize in Literature 1947 was awarded to Andre Gide for his comprehensive and artistically significant writings, in which human problems and conditions have been presented with a fearless love of truth and keen psychological insight.
...

Change log

See also