HTML document (jsoup) filters

HTML document filters, or jsoup filters, are used to make modifications to the HTML document structure, or perform operations that select and transform the document’s DOM or content. Custom jsoup filters can be written to perform operations such as:

  • Cleaning page titles by removing the site name.

  • Scraping content (e.g. extracting breadcrumbs to metadata).

These jsoup filters should not be confused with the generic document filters, which apply transformations to the raw document. In fact, jsoup filters are facilitated via a generic document filter, JsoupFilterProvider, which converts a document containing HTML code into a structured jsoup object representing the HTML document. This object is then manipulated via a series of configurable HTML document filters before being converted back to a text document containing HTML code which is passed onto the next generic document filter. The following diagram shows the generic filter chain and the jsoup filters that run when the generic JsoupFilterProvider filter runs.

the-filter-chain-01.png

The jsoup filter chain

The HTML document filters, through with the HTML object is passed, are set in the jsoup filter chain which is configured via the filter.jsoup.classes configuration option.

The jsoup filter chain is a comma-separated list of jsoup filter class names which are applied only to HTML documents.

The jsoup filters that make up the chain are chosen from:

  • Built-in jsoup filters: jsoup filters that are part of the core Funnelback product.

  • Plugins: filters that are provided by enabling specific Funnelback plugins.

  • Custom Groovy jsoup filters: User-defined Groovy jsoup filters that implement custom filter logic.

The set of filters below would be processed as follows: The content would pass through either JsoupFilter1 before being passed on to JsoupFilter2 then JsoupFilter3.

filter.jsoup.classes=JsoupFilter1,JsoupFilter2,JsoupFilter3
Filtering runs as part of the gather process and must be configured on the collection that gathers the content (and not on a meta collection).

Built-in jsoup filters

Funnelback ships with a number of built-in jsoup filters configured which are used to produce the metadata required to build the content and accessibility auditor reports:

Class name Description

MetadataScraper

Scrapes content and injects it as metadata.

ContentGeneratorUrlDetection

Detects additional URLs for the given content based on the site generator (e.g. CMS specific edit links)

FleschKincaidGradeLevel

Estimates how readable the document is, and records the estimate with the document.

UndesirableText

Detects occurrences of configured undesirable text and records them for content auditor to report upon.

TitleDuplicates

Detects occurrences of duplicate page titles for content auditor to report upon.

Plugins

Funnelback plugins can provide additional jsoup filters. When using a plugin always follow the instructions as outlined in the plugin’s readme file.

Configure Funnelback to use a jsoup filter that is included as part of a plugin by:

  1. Enabling the plugin.

  2. Adding the class name as detailed in the plugin’s readme to the filter.jsoup.classes.

e.g. to use only the example plugin jsoup custom filter with class name com.example.customPluginJsoupFilter

filter.jsoup.classes=com.example.customPluginJsoupFilter

e.g. to use the example plugin jsoup filter with class name com.example.customPluginJsoupFilter in addition to the default filters:

filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:com.example.customPluginJsoupFilter

This runs the filters as defined in the default filters section (above) then feeds the document (with fixed title) into the custom plugin filter.

Custom Groovy jsoup filters

custom Groovy jsoup filters are only available when using Funnelback Server.

Custom Groovy jsoup filters can be installed to be used with a collection by:

  1. Adding the custom filter files (including any dependencies) and sub-folder structure to the collection’s @groovy folder.

  2. Adding the class name as detailed in the custom filter’s documentation to the filter.jsoup.classes. The class names used by custom Groovy jsoup filters are similar to the class names used by plugin filters.

requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the @groovy folder.

Shared groovy filters

custom Groovy jsoup filters are only available when using Funnelback Server.

Custom Groovy jsoup filters can also be shared between all collections on a server by installing the filter into the global groovy folder.

Custom Groovy jsoup filters can be installed globally by:

  1. Adding the custom filter files (including any dependencies) and sub-folder structure to $SEARCH_HOME/lib/java/groovy.

  2. Adding the class name as detailed in the custom filter’s documentation to the filter.classes. The class names used by custom Groovy jsoup filters are similar to the class names used by plugin filters.

requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the global groovy folder.