Content auditor undesirable text report customization

The undesirable words report in content auditor identifies URLs that contain words that are seen as undesirable. This includes common misspellings, but can be augmented with organisation-specific words - such as avoid words in company style policy, or industry specific terms.

Funnelback uses Wikipedia’s common misspellings list to identify undesirable words. This list can be replaced or augmented with custom lists of terms.

Customization of undesirable text requires a full update of the data source to apply the changes.

Add additional words and phrases to the undesirable text report

Add individual words and phrases

If you have a small number of additional words that you would like to include in the content auditor undesirable text report then these can be added via individual configuration keys set in the data source configuration.

Add a list of words and phrases

If you have a list of additional words and phrases that you would like to include in the content auditor undesirable text report then these can be added as an additional undesirable text source.

a full update is required – an incremental update is not sufficient because filter changes won’t be applied to content that is not downloaded. . After the update completes return to the content auditor report for the collection and observe that occurrences of the words added to the custom undesirable text file are now included in the words listed as undesirable text. Clicking on one of the terms will filter the report to only pages containing the selected word.

Add additional undesirable text reporting as separate reports

Content auditor doesn’t currently support multiple undesirable text reports. However, the metadata reporting functionality within content auditor can be used to report on additional undesirable text sources.

To set up separate reporting for the additional sources:

  1. add the additional sources as outlined above : Add additional words and phrases to the undesirable text report

  2. Configure the undesirable text filter to use the separated additional word lists. See: filter.jsoup.undesirable_text-separate-lists=true.

  3. Configure content auditor to display the additional word list metadata for the extra metadata fields (X-Funnelback-Undesirable-Text-ID). See:Content auditor overview and attributes metadata report customization

    The extra metadata fields will need to be mapped to Funnelback metadata classes using standard metadata mapping rules.

See also