filter.jsoup.undesirable_text-source.[key_name]
Background
This setting provides additional 'undesirable text' sources for use in content auditor, enabling analysis of HTML documents for other things such as:
-
offensive words
-
weasel words
-
plain English
-
organizational-specific avoid words
Each additional source is a text file containing a set of words and phrases that can be uploaded via the file manager.
The sources are linked by adding a configuration key for each file:
filter.jsoup.undesirable_text-source.ID=ID
where:
-
ID
: is a unique identifier for the source.
Setting the file name to an absolute path is deprecated. Old configuration that is imported into v16 must be updated to use a file name instead of a file path. |
The format of the file at the given path is a list of undesirable word sequences, with one word or phrase per line.
Phrases should separate words with a single space character. The words should not include HTML entities. e.g. \u2014
instead of —
should be used where applicable.
Undesirable text files can be created from the administration dashboard file manager by creating or uploading an
undesirable-text.*.cfg
file. When creating the file the name should follow the format, undesirable-text.ID.cfg
.
If any of these words are detected when the HTML document is analyzed the detected word or phrase will be added as a value to the following metadata fields:
-
X-Funnelback-Undesirable-Text
, iffilter.jsoup.undesirable_text-separate-lists
is disabled -
X-Funnelback-Undesirable-Text-ID
, iffilter.jsoup.undesirable_text-separate-lists
is enabled
A count of the occurrences of all undesirable words found in the page will also be recorded. The count is a total of all the detected words, including duplicates. The count will be recorded in the following metadata field:
-
X-Funnelback-Undesirable-Text-Count
, iffilter.jsoup.undesirable_text-separate-lists
is disabled -
X-Funnelback-Undesirable-Text-[ID]-Count
, iffilter.jsoup.undesirable_text-separate-lists
is enabled
Setting the key
Set this configuration key in the search package or data source configuration.
Use the configuration key editor to add or edit the filter.jsoup.undesirable_text-source.[key_name]
key, and set the value. This can be set to any valid String
value.
Default value
Content auditor uses a default-misspellings source filter.jsoup.undesirable_text-source.default-misspellings
which is used to populate the undesirable text report.
This source provides a list of commonly misspelled words in English based on Wikipedia’s list of common misspellings for machines.
The source can be set to a custom file, and additional sources can also be configured.
Examples
Use a custom misspellings file, and also includes an additional set from containing weasel words.
filter.jsoup.undesirable_text-source.default-misspellings=default-misspellings
filter.jsoup.undesirable_text-source.weasel-words=weasel-words
where undesirable-text.weasel-words.cfg
file contains:
many
various
very
fairly