Configuring document filtering
Introduction
Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.
Filtering passes the raw document through multiple chained filters which may modify the document in different ways. These modifications may include converting the document format from PDF to an indexable text format like HTML or modifying the document like adding metadata or altering the document’s URL.
Filtering is run during the gather phase of an update. For push collections filtering can be run when a document is added.
A full update is required after making any changes to filters as documents that are copied during an incremental update are not re-filtered. Full updates are started from the advanced update screen. |
Generic document and HTML document (jsoup) filters
Funnelback supports two main types of filters:
-
Generic document filters operate on the document as a whole taking the complete document as an input, applying some sort of transformation then outputting complete document which can then be fed into another filter. A generic document filter treats the document as either a binary byte stream, or as an unstructured blob of text.
-
HTML document filters (jsoup filters) are special filters that apply only to HTML documents. A HTML document filter handles the HTML as a structured object allowing precise and complex manipulation of the HTML document. HTML document (jsoup) filters are only run when the
JSoupProcessingFilterProvider
is included in the filter chain.
The filter chain
The filter chain specifies the set of generic document filters that will be applied to a document after it is gathered prior to indexing. The contents of the gathered document passes through each filter in turn with the modified output being passed on to the input of next filter.
A typical filter chain is shown below. A binary document is converted to text using the Tika filters. This extracts the document text and outputs the document as HTML. This HTML is then passed through the JSoup filter (see the HTML document filters section below) which enables targeted modification of the HTML content and structure. Finally a custom filter performs a number of modifications to the content.

Configuring the filter chain (filter.classes)
During the filter phase the document passes through a series of generic document filters with the modified output being passed on to the next filter. The series of filters is referred to as the filter chain, set using the filter.classes
configuration option.
Each filter is represented by a Java class name, with filters separated by either a comma or semi-colon to denote a choice or chain (see below).
The filters that make up the chain are chosen from:
-
Built-in filters: filters that are part of the core Funnelback product.
-
Plugins: filters that are provided by enabling specific Funnelback plugins.
-
Custom Groovy filters: User-defined Groovy filters that implement custom filter logic.
Choices and chains
The filter chain is made up of a set of filters that are executed in an order as defined in the filter.classes
.
The order is made up of choice and chain elements, indicated by the two types of delimiters.
Filters are chained together by separating the filters with a semicolon. The filters separated by semicolons are run in order from left to right. e.g. filter.classes=Filter1:Filter2
will result in filters being applied in the following order: Filter1 followed by Filter2.
It is possible to provide a set of filters for one step in the chain of which up to one will run as part of the step. A choice is made between this group of filters (that are separated with commas), with the order or preference in the reverse order. e.g. a choice indicated by Filter1,Filter2 would ensure that either Filter2 or Filter1 was chosen for this part of the chain. Filter2 will run in preference to Filter1 - determined by the rules around when a filter can run (e.g. some filters only run under specific conditions such as only for HTML documents).
Example, the filter chain Filter1,Filter2,Filter3:Filter4:Filter5
would be processed as follows:
The content would pass through either Filter3
, Filter2
, Filter1
or none of these (in that order) before passing through Filter4
and Filter5
.
The diagram below shows this filter chain graphically:

NOTE:
-
Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.
-
When specifying a combination of choice and chain filters ',' has higher precedence than ':'. In other words it is possible to have a chain of choice filters, but it is not possible to have a choice between several chains of filters.
Default filter chain
The default filter chain runs the following built-in filters:
filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
Which translates to:
-
Run Tika or an external filter (to convert a binary document to text)
-
Pass the extracted text through the jsoup filter (which manipulates HTML documents). The default jsoup filters run a number of content auditing tasks. See: HTML document (jsoup) filters.
-
Pass the manipulated HTML document through the document title fixer.
Built-in filters
Funnelback ships with the following built-in filters.
Configure Funnelback to use a built-in filter by adding the class name to the filter.classes
.
e.g. filter.classes=ForceCSVMime:CSVToXML
replaces the default filter chain with the two built-in filters for handling CSV indexing.
Class name | Description |
---|---|
Converts records in a CSV, TSV, SQL or Excel document to multiple XML documents. |
|
Analyses the document title and attempts to replace it if the title is not considered a good title. HTML documents only. |
|
Uses external programs to convert documents. |
|
Sets the MIME type of all documents to |
|
Sets the MIME type of all documents to |
|
Sets the MIME type of all documents to |
|
Automatically inserts noindex tags based on CSS selectors. |
|
Converts JSON documents to XML. |
|
Converts HTML documents to and from a jsoup object and runs an extra chain of jsoup filters. |
|
Used to normalise and replace metadata fields. |
|
Convert binary files of specific file formats (Microsoft Office files, PDF files, etc.) to HTML using Apache Tika. |
|
Used for detecting whether a URL contains textual content. Used by the Content Auditor. |
|
Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.) |
Plugins
Funnelback plugins can provide additional filters. When using a plugin always follow the instructions as outlined in the plugin’s readme file.
Configure Funnelback to use a filter that is included as part of a plugin by:
-
Adding the class name as detailed in the plugin’s readme to the
filter.classes
.
e.g. to use only the example plugin custom filter with class name com.example.customPluginFilter
filter.classes=com.example.customPluginFilter
e.g. to use the example plugin filter with class name com.example.customFilter
in addition to the default filters:
filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:com.example.customPluginFilter
This runs the filters as defined in the default filters section (above) then feeds the document (with fixed title) into the custom plugin filter.
Filtering runs as part of the gather process and must be configured on the collection that gathers the content (and not on a meta collection). |
Custom groovy filters
custom groovy filters are only available when using Funnelback Server. |
Custom Groovy filters can be installed to be used with a collection by:
-
Adding the custom filter files (including any dependencies) and sub-folder structure to the collection’s
@groovy
folder. -
Adding the class name as detailed in the custom filter’s documentation to the
filter.classes
. The class names used by custom Groovy filters are similar to the class names used by plugin filters.
requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the @groovy folder.
|
Shared Groovy filters
Custom Groovy filters can also be shared between all collections on a server by installing the filter into the global Groovy folder.
Custom Groovy filters can be installed globally by:
-
Adding the custom filter files (including any dependencies) and sub-folder structure to
$SEARCH_HOME/lib/java/groovy
. -
Adding the class name as detailed in the custom filter’s documentation to the
filter.classes
. The class names used by custom Groovy filters are similar to the class names used by plugin filters.
requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the global groovy folder.
|