filter.jsoup.undesirable_text-source.[key_name]

Background

This setting provides additional 'undesirable text' sources for use in content auditor, enabling analysis of HTML documents for other things such as:

  • offensive words

  • weasel words

  • plain English

  • organizational-specific avoid words

Each additional source is a text file containing a set of words and phrases that can be uploaded via the file manager.

The sources are linked by adding a configuration key for each file:

filter.jsoup.undesirable_text-source.ID=ID

where:

  • ID: is a unique identifier for the source.

Setting the file name to an absolute path is deprecated. Old configuration that is imported into v16 must be updated to use a file name instead of a file path.

The format of the file at the given path is a list of undesirable word sequences, with one word or phrase per line.

Phrases should separate words with a single space character. The words should not include HTML entities. e.g. \u2014 instead of — should be used where applicable.

Undesirable text files can be created from the administration dashboard file manager by creating or uploading an undesirable-text.*.cfg file. When creating the file the name should follow the format, undesirable-text.ID.cfg.

If any of these words are detected when the HTML document is analyzed the detected word or phrase will be added as a value to the following metadata fields:

A count of the occurrences of all undesirable words found in the page will also be recorded. The count is a total of all the detected words, including duplicates. The count will be recorded in the following metadata field:

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the filter.jsoup.undesirable_text-source.[key_name] key, and set the value. This can be set to any valid String value.

Default value

Content auditor uses a default-misspellings source filter.jsoup.undesirable_text-source.default-misspellings which is used to populate the undesirable text report.

This source provides a list of commonly misspelled words in English based on Wikipedia’s list of common misspellings for machines.

The source can be set to a custom file, and additional sources can also be configured.

Examples

Use a custom misspellings file, and also includes an additional set from containing weasel words.

filter.jsoup.undesirable_text-source.default-misspellings=default-misspellings
filter.jsoup.undesirable_text-source.weasel-words=weasel-words

where undesirable-text.weasel-words.cfg file contains:

many
various
very
fairly