Plugin: Clean title

Purpose

This plugin removes segments of a title that match user defined regular expressions supplied via configuration.

This plugin supports two types of usage:

  • Cleaning the search result title: When enabled on a results page it can be used to modify the value of the result.title data model element. Use the plugin on a results page if you just need to update the sear result titles (regardless of the underlying data source type). Note: if you modify the result.title using this method then sorting by title may be incorrect as the renaming occurs after the result set has been sorted. Sorting will be incorrect if you modify the start of any titles and the regex pattern does not match all search result titles.

  • Cleaning the title in HTML source data: When enabled on a data source it can be used to clean the contents of the html <title> element. Note: This only applies to html documents and will update the value of the <title> element. The advantage of modifying the source data is that the title included in the index will contain the modification meaning that sorting will function correctly if the search result title is based on the html <title> value.

Usage

Cleaning the search result title

  1. Enable the clean-title plugin on your results page from the Extensions screen in the administration dashboard or add the following results page configuration to enable the plugin.

    plugin.clean-title.enabled=true
    plugin.clean-title.version=1.0.0
  2. Configure the plugin by setting the following option in the results page configuration:

    • plugin.clean-title.config.regex.<name>=<REGEX PATTERN>: Removes strings within the data model’s result.title element that match the supplied regular expression. The <name> part of the key must be unique and allows for multiple regex patterns to be defined. If multiple regex keys are defined then they will be executed in sequence with the order determined by the name when sorted alphabetically. (e.g. if you have three keys plugin.clean-title.config.regex.orange, plugin.clean-title.config.regex.apple, plugin.clean-title.config.regex.pear then the regexes will be applied in the following order: plugin.clean-title.config.regex.apple, plugin.clean-title.config.regex.orange, plugin.clean-title.config.regex.pear).

The plugin will take effect immediately when enabled on a results page.

Cleaning the title in HTML source data

When used on a data source only HTML documents will be modified.
  1. Enable the clean-title plugin on your data source from the Extensions screen in the administration dashboard or add the following data source configuration to enable the plugin.

    plugin.clean-title.enabled=true
    plugin.clean-title.version=1.0.0
  2. Add the clean title filter to the jsoup filter chain

    filter.jsoup.classes=<OTHER-JSOUP-FILTERS>,com.funnelback.plugin.cleantitle.CleanTitleFilter,<OTHER-JSOUP-FILTERS>
    The clean title filter should be placed at an appropriate position in the filter chain (which applies the filters from left to right). In most circumstances this should be located towards the end of the jsoup filter chain.
  3. Configure the plugin by setting the following option in the data source configuration:

    • plugin.clean-title.config.regex.<name>=<REGEX PATTERN>: Removes strings within a html document’s <html> element that match the supplied regular expression. The <name> part of the key must be unique and allows for multiple regex patterns to be defined. If multiple regex keys are defined then they will be executed in sequence with the order determined by the name when sorted alphabetically. (e.g. if you have three keys plugin.clean-title.config.regex.orange, plugin.clean-title.config.regex.apple, plugin.clean-title.config.regex.pear then the regexes will be applied in the following order: plugin.clean-title.config.regex.apple, plugin.clean-title.config.regex.orange, plugin.clean-title.config.regex.pear).

  4. Run a full update of the data source must be run for any changes to apply. Note: a full update is required as all of your documents must be re-gathered and filtered for any changes to take effect. If you are using this with a push data source then you will need to resubmit anything where you want the new filter to be applied.

Example: Clean a prefix and suffix from search result titles

This example applies for both use cases outlined above.

Consider we have titles like:

ExampleOrg - Page title (www.example.com)

Where many pages have titles that are prefixed with ExampleOrg - and contain a suffix of (www.example.com).

You would like the Page title to be displayed as the hyperlinked title in your search results.

This could be achieved by setting the following configuration keys:

plugin.clean-title.config.regex.generic-prefix=^ExampleOrg -\s+
plugin.clean-title.config.regex.generic-suffix=\s+\(www\.example\.com\)$

This runs each of the regexes on the result title or <title> element thus we remove both the prefix and suffix.

The generic-prefix and generic-suffix names could have been called anything, but remember that the names used will define the order in which the patterns are applied.

© 2015- Squiz Pty Ltd