Define sources of undesirable text to detect and present within content auditor.
Can be set in: collection.cfg
This setting provides additional 'undesirable text' sources for use in content auditor, enabling analysis of HTML documents for other things such as:
organizational-specific avoid words
Each additional source is a text file containing a set of words and phrases that can be uploaded via the file manager.
The sources are linked by adding a configuration key for each file:
ID: is a unique identifier for the source.
|Setting the file name to an absolute path is deprecated. Old configuration that is imported into v16 must be updated to use a file name instead of a file path.|
The format of the file at the given path is a list of undesirable word sequences, with one word or phrase per line.
Phrases should separate words with a single space character. The words should not include HTML entities. e.g.
— should be used where applicable.
Undesirable text files can be created from the administration dashboard file manager by creating or uploading an
undesirable-text.*.cfg file. When creating the file the name should follow the format,
If any of these words are detected when the HTML document is analyzed the detected word or phrase will be added as a value to the following metadata fields:
A count of the occurrences of all undesirable words found in the page will also be recorded. The count is a total of all the detected words, including duplicates. The count will be recorded in the following metadata field:
Content auditor uses a default-misspellings source
which is used to populate the undesirable text report.
This source provides a list of commonly misspelled words in English based on Wikipedia’s list of common misspellings for machines.
The source can be set to a custom file, and additional sources can also be configured.
Use a custom misspellings file, and also includes an additional set from containing weasel words.
undesirable-text.weasel-words.cfg file contains:
many various very fairly