Customize the content auditor undesirable text list

ON THIS PAGE

Background

The undesirable words report in content auditor identifies URLs that contain words that are seen as undesirable. This includes common misspellings, but can be augmented with organisation-specific words - such as avoid words in company style policy, or industry specific terms.

Funnelback uses Wikipedia’s common misspellings list to identify undesirable words. This list can be replaced or augmented with custom lists of terms.

Customization of undesirable text requires a full update of the data source to apply the changes.

Process

  1. From the administration dashboard switch select the desired data source and edit the data source configuration.

  2. Select Perl file manager from the settings section,

  3. Create a new undesirable-text.*.cfg file.

  4. Set the filename by editing the text field above the main content editor by replacing the * with a key (e.g. additional), then edit the file. Add the list of undesirable terms, one per line, then save the file. This configures content auditor to identify pages that contain these words.

  5. Return to the data source configuration screen and select Edit the data source configuration and add a new parameter as follows.

    Parameter key Key Value

    filter.jsoup.undesirable_text-source.*

    KEY

    FILE_NAME

    Where KEY is the key name you specified in the file name (e.g. additional) and FILE_NAME is the file name (e.g. $SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.additional.cfg)

  6. Run a full update of the data source.

    a full update is required – an incremental update is not sufficient because filter changes won’t be applied to content that is not downloaded.
  7. After the update completes return to the content auditor report for the collection and observe that occurrences of the words added to the custom undesirable text file are now included in the words listed as undesirable text. Clicking on one of the terms will filter the report to only pages containing the selected word.

See also

© 2015- Squiz Pty Ltd