Configuring document filtering

Introduction

Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.

Filtering passes the raw document through multiple chained filters which may modify the document in different ways. These modifications may include converting the document format from PDF to an indexable text format like HTML or modifying the document like adding metadata or altering the document’s URL.

Filtering is run during the gather phase of an update. For push collections filtering can be run when a document is added.

A full update is required after making any changes to filters as documents that are copied during an incremental update are not re-filtered. Full updates are started from the advanced update screen.

Generic document and HTML document (jsoup) filters

Funnelback supports two main types of filters:

  • Generic document filters operate on the document as a whole taking the complete document as an input, applying some sort of transformation then outputting complete document which can then be fed into another filter. A generic document filter treats the document as either a binary byte stream, or as an unstructured blob of text.

  • HTML document filters (jsoup filters) are special filters that apply only to HTML documents. A HTML document filter handles the HTML as a structured object allowing precise and complex manipulation of the HTML document. HTML document (jsoup) filters are only run when the JSoupProcessingFilterProvider is included in the filter chain.

The filter chain

The filter chain specifies the set of generic document filters that will be applied to a document after it is gathered prior to indexing. The contents of the gathered document passes through each filter in turn with the modified output being passed on to the input of next filter.

A typical filter chain is shown below. A binary document is converted to text using the Tika filters. This extracts the document text and outputs the document as HTML. This HTML is then passed through the JSoup filter (see the HTML document filters section below) which enables targeted modification of the HTML content and structure. Finally a custom filter performs a number of modifications to the content.

the-filter-chain-01.png

Configuring the filter chain (filter.classes)

During the filter phase the document passes through a series of generic document filters with the modified output being passed on to the next filter. The series of filters is referred to as the filter chain, set using the filter.classes configuration option.

Each filter is represented by a Java class name, with filters separated by either a comma or semi-colon to denote a choice or chain (see below).

The filters that make up the chain are chosen from:

  • Built-in filters: filters that are part of the core Funnelback product.

  • Plugins: filters that are provided by enabling specific Funnelback plugins.

  • Custom Groovy filters: User-defined Groovy filters that implement custom filter logic.

Choices and chains

The filter chain is made up of a set of filters that are executed in an order as defined in the filter.classes.

The order is made up of choice and chain elements, indicated by the two types of delimiters.

Filters are chained together by separating the filters with a semicolon. The filters separated by semicolons are run in order from left to right. e.g. filter.classes=Filter1:Filter2 will result in filters being applied in the following order: Filter1 followed by Filter2.

It is possible to provide a set of filters for one step in the chain of which up to one will run as part of the step. A choice is made between this group of filters (that are separated with commas), with the order or preference in the reverse order. e.g. a choice indicated by Filter1,Filter2 would ensure that either Filter2 or Filter1 was chosen for this part of the chain. Filter2 will run in preference to Filter1 - determined by the rules around when a filter can run (e.g. some filters only run under specific conditions such as only for HTML documents).

Example, the filter chain Filter1,Filter2,Filter3:Filter4:Filter5 would be processed as follows:

The content would pass through either Filter3, Filter2, Filter1 or none of these (in that order) before passing through Filter4 and Filter5.

The diagram below shows this filter chain graphically:

filter-choices-chains.png

NOTE:

  • Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.

  • When specifying a combination of choice and chain filters ',' has higher precedence than ':'. In other words it is possible to have a chain of choice filters, but it is not possible to have a choice between several chains of filters.

Default filter chain

The default filter chain runs the following built-in filters:

filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider

Which translates to:

  1. Run Tika or an external filter (to convert a binary document to text)

  2. Pass the extracted text through the jsoup filter (which manipulates HTML documents). The default jsoup filters run a number of content auditing tasks. See: HTML document (jsoup) filters.

  3. Pass the manipulated HTML document through the document title fixer.

Built-in filters

Funnelback ships with the following built-in filters.

Configure Funnelback to use a built-in filter by adding the class name to the filter.classes.

e.g. filter.classes=ForceCSVMime:CSVToXML replaces the default filter chain with the two built-in filters for handling CSV indexing.

Class name Description

CSVToXML

Converts records in a CSV, TSV, SQL or Excel document to multiple XML documents.

DocumentFixerFilterProvider

Analyses the document title and attempts to replace it if the title is not considered a good title. HTML documents only.

ExternalFilterProvider

Uses external programs to convert documents.

ForceCSVMime

Sets the MIME type of all documents to text/csv.

ForceJSONMime

Sets the MIME type of all documents to application/json.

ForceXMLMime

Sets the MIME type of all documents to text/xml.

InjectNoIndexFilterProvider

Automatically inserts noindex tags based on CSS selectors.

JSONToXML

Converts JSON documents to XML.

JSoupProcessingFilterProvider

Converts HTML documents to and from a jsoup object and runs an extra chain of jsoup filters.

MetadataNormaliser

Used to normalise and replace metadata fields.

TikaFilterProvider

Convert binary files of specific file formats (Microsoft Office files, PDF files, etc.) to HTML using Apache Tika.

TextDetectionFilterProvider

Used for detecting whether a URL contains textual content. Used by the Content Auditor.

WorkflowFilter

Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.)

Plugins

Funnelback plugins can provide additional filters. When using a plugin always follow the instructions as outlined in the plugin’s readme file.

Configure Funnelback to use a filter that is included as part of a plugin by:

  1. Enabling the plugin.

  2. Adding the class name as detailed in the plugin’s readme to the filter.classes.

e.g. to use only the example plugin custom filter with class name com.example.customPluginFilter

filter.classes=com.example.customPluginFilter

e.g. to use the example plugin filter with class name com.example.customFilter in addition to the default filters:

filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider:com.example.customPluginFilter

This runs the filters as defined in the default filters section (above) then feeds the document (with fixed title) into the custom plugin filter.

Filtering runs as part of the gather process and must be configured on the collection that gathers the content (and not on a meta collection).

Custom groovy filters

custom groovy filters are only available when using Funnelback Server.

Custom Groovy filters can be installed to be used with a collection by:

  1. Adding the custom filter files (including any dependencies) and sub-folder structure to the collection’s @groovy folder.

  2. Adding the class name as detailed in the custom filter’s documentation to the filter.classes. The class names used by custom Groovy filters are similar to the class names used by plugin filters.

requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the @groovy folder.

Shared Groovy filters

Custom Groovy filters can also be shared between all collections on a server by installing the filter into the global Groovy folder.

Custom Groovy filters can be installed globally by:

  1. Adding the custom filter files (including any dependencies) and sub-folder structure to $SEARCH_HOME/lib/java/groovy.

  2. Adding the class name as detailed in the custom filter’s documentation to the filter.classes. The class names used by custom Groovy filters are similar to the class names used by plugin filters.

requires backend (SSH or server desktop) access to the Funnelback server to manage files and sub-folders added to the global groovy folder.

© 2015- Squiz Pty Ltd