HTML document (jsoup) filters
HTML document filters, or jsoup filters, are used to make modifications to the HTML document structure, or perform operations that select and transform the document’s DOM or content. Custom jsoup filters can be written to perform operations such as:
-
Cleaning page titles by removing the site name.
-
Scraping content (e.g. extracting breadcrumbs to metadata).
These jsoup filters should not be confused with the generic document filters, which apply transformations to the raw document. In fact, jsoup filters are facilitated via a generic document filter, JsoupFilterProvider, which converts a document containing HTML code into a structured jsoup object representing the HTML document. This object is then manipulated via a series of configurable HTML document filters before being converted back to a text document containing HTML code which is passed onto the next generic document filter. The following diagram shows the generic filter chain and the jsoup filters that run when the generic JsoupFilterProvider filter runs.
The jsoup filter chain
The HTML document filters, through with the HTML object is passed, are set in the jsoup filter chain which is configured via the filter.jsoup.classes
configuration option.
The jsoup filter chain is a comma-separated list of jsoup filter class names which are applied only to HTML documents.
The jsoup filters that make up the chain are chosen from:
-
Built-in jsoup filters: jsoup filters that are part of the core Funnelback product.
-
Plugins: filters that are provided by enabling specific Funnelback plugins.
-
Custom Groovy jsoup filters: User-defined Groovy jsoup filters that implement custom filter logic.
The set of filters below would be processed as follows: The content would pass through either JsoupFilter1
before being passed on to JsoupFilter2
then JsoupFilter3
.
filter.jsoup.classes=JsoupFilter1,JsoupFilter2,JsoupFilter3
Filtering runs as part of the gather process and must be configured on the collection that gathers the content (and not on a meta collection). |
Built-in jsoup filters
Funnelback ships with a number of built-in jsoup filters configured which are used to produce the metadata required to build the content and accessibility auditor reports:
Class name | Description |
---|---|
Scrapes content and injects it as metadata. |
|
Detects additional URLs for the given content based on the site generator (e.g. CMS specific edit links) |
|
Estimates how readable the document is, and records the estimate with the document. |
|
Detects occurrences of configured undesirable text and records them for content auditor to report upon. |
|
Detects occurrences of duplicate page titles for content auditor to report upon. |
Plugins
Funnelback plugins can provide additional jsoup filters. When using a plugin always follow the instructions as outlined in the plugin’s readme file.
Configure Funnelback to use a jsoup filter that is included as part of a plugin by:
-
Adding the class name as detailed in the plugin’s readme to the
filter.jsoup.classes
.
e.g. to use only the example plugin jsoup custom filter with class name com.example.customPluginJsoupFilter
filter.jsoup.classes=com.example.customPluginJsoupFilter
e.g. to use the example plugin jsoup filter with class name com.example.customPluginJsoupFilter
in addition to the default filters:
filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:com.example.customPluginJsoupFilter
This runs the filters as defined in the default filters section (above) then feeds the document (with fixed title) into the custom plugin filter.
Custom Groovy jsoup filters
custom Groovy jsoup filters are only available when using Funnelback Server. |
Custom Groovy jsoup filters can be installed to be used with a collection by:
-
Adding the custom filter files (including any dependencies) and sub-folder structure to the collection’s
@groovy
folder. -
Adding the class name as detailed in the custom filter’s documentation to the
filter.jsoup.classes
. The class names used by custom Groovy jsoup filters are similar to the class names used by plugin filters.
requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the @groovy folder.
|
Shared groovy filters
custom Groovy jsoup filters are only available when using Funnelback Server. |
Custom Groovy jsoup filters can also be shared between all collections on a server by installing the filter into the global groovy folder.
Custom Groovy jsoup filters can be installed globally by:
-
Adding the custom filter files (including any dependencies) and sub-folder structure to
$SEARCH_HOME/lib/java/groovy
. -
Adding the class name as detailed in the custom filter’s documentation to the
filter.classes
. The class names used by custom Groovy jsoup filters are similar to the class names used by plugin filters.
requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the global groovy folder.
|