Document filtering

Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.

Filtering passes the raw document through multiple chained filters which may modify the document in different ways. These modifications may include converting the document from a binary format such as PDF to an indexable text format or modifying the document by adding metadata or altering the document’s URL.

Filtering is run during the gather phase of a data source update. For push data sources filtering can be run when a document is added by setting the filter chain when the push API call is made.

A full update is required after making any changes to filters as documents that are copied during an incremental update are not re-filtered. Full updates are started from the data source advanced update screen.

Generic document and HTML document (jsoup) filters

Funnelback supports two main types of filters:

  • Generic document filters operate on the document as a whole taking the complete document as an input, applying some sort of transformation then outputting complete document which can then be fed into another filter. A generic document filter treats the document as either a binary byte stream, or as an unstructured blob of text.

  • HTML document filters (jsoup filters) are special filters that apply only to HTML documents. A HTML document filter handles the HTML as a structured object allowing precise and complex manipulation of the HTML document. HTML document (jsoup) filters are only run when the JSoupProcessingFilterProvider is included in the filter chain.

The filter chain

The filter chain specifies the set of generic document filters that will be applied to a document after it is gathered prior to indexing. The contents of the gathered document passes through each filter in turn with the modified output being passed on to the input of next filter.

A typical filter chain is shown below. A binary document is converted to text using the Tika filters. This extracts the document text and outputs the document as HTML. This HTML is then passed through the JSoup filter (see the HTML document filters section below) which enables targeted modification of the HTML content and structure. Finally a custom filter performs a number of modifications to the content.

the-filter-chain-01.png

Each filter is represented by a Java class name, with filters separated by either a comma or semi-colon to denote a choice or chain (see below).

The filters that make up the chain are chosen from:

  • Built-in filters: filters that are part of the core Funnelback product.

  • Plugins: filters that are provided by enabling specific Funnelback plugins.

  • Custom Groovy filters: User-defined Groovy filters that implement custom filter logic. (Not available in Funnelback SXC)

Filter chain steps

The filter chain is made up of a series of chained filtering steps which are executed in order with the output of the first step in the chain being fed into the second step in the chain.

Each step in the chain must specify one or more filters. If more than one filter is specified then a choice is made between the specified filters and up to one of these filters will be run. Note: it’s possible that none of the filters will match execution rules in which case the document content is passed through unchanged to the next filter in the chain.

For the filter chain represented graphically below:

filter-choices-chains.png

The binary input content would pass through either Filter3, Filter2, Filter1 or none of these (in that order) before passing through Filter4 and Filter5.

  • Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.

  • The filter framework only supports a single filter chain. A step in the filter chain may choose between a set of filters, but you can’t define a choice between a set of filter chains.

Configuring the filter chain

During the filter phase the document passes through a series of generic document filters with the modified output being passed on to the next filter. The series of filters is referred to as the filter chain, and is set using the filter.classes configuration option.

To modify the filter chain:

  • Log in to the administration dashboard.

  • Locate the data source where the filters are to be applied and view the data source details screen.

  • Select edit data source configuration from the configuration section.

  • Edit the filter.classes configuration option. This is located in the workflow section of the configuration key editor and will also be displayed in the currently set keys section if the option has been modified from the default value.

  • The filter class names are case sensitive.

  • When editing the configuration key each chain or set of choices (separated with commas) must be set on a new line.

For example the default filter chain has three steps which run the following built-in filters:

filter-classes-key.png
  1. Shows a set of choices: run an external filter or Tika (in that order) to convert a binary document to text. Note the comma that is used to delimit the two filters.

  2. The steps of the filter chain. Each step is on a new line, with the filters being chained in the order defined from top to bottom. This configures Funnelback to choose between the external filter or Tika, then pass the extracted text through the jsoup filter (which manipulates HTML documents). The default jsoup filters run a number of content auditing tasks. See: HTML document (jsoup) filters. Finally, pass the manipulated HTML document through the document title fixer.

If you use the raw configuration key editor to set the filter.classes you must separate your chained filters with semicolons. The default filter chain shown above is represented as:

filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider

Built-in filters

Funnelback ships with the following built-in filters.

Configure Funnelback to use a built-in filter by adding the class name to the filter.classes.

e.g. filter.classes=ForceCSVMime:CSVToXML replaces the default filter chain with the two built-in filters for handling CSV indexing.

Class name Description

CSVToXML

Converts records in a CSV, TSV, SQL or Excel document to multiple XML documents.

DocumentFixerFilterProvider

Analyses the document title and attempts to replace it if the title is not considered a good title. HTML documents only.

ExternalFilterProvider

Uses external programs to convert documents.

ForceCSVMime

Sets the MIME type of all documents to text/csv.

ForceJSONMime

Sets the MIME type of all documents to application/json.

ForceXMLMime

Sets the MIME type of all documents to text/xml.

InjectNoIndexFilterProvider

Automatically inserts noindex tags based on CSS selectors.

JSONToXML

Converts JSON documents to XML.

JSoupProcessingFilterProvider

Converts HTML documents to and from a jsoup object and runs an extra chain of jsoup filters.

MetadataNormaliser

Used to normalise and replace metadata fields.

TikaFilterProvider

Convert binary files of specific file formats (Microsoft Office files, PDF files, etc.) to HTML using Apache Tika.

TextDetectionFilterProvider

Used for detecting whether a URL contains textual content. Used by the Content Auditor.

WorkflowFilter

This feature is deprecated.

Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.)

Plugins

Funnelback plugins can provide additional filters. When using a plugin always follow the instructions as outlined in the plugin’s readme file.

Configure Funnelback to use a filter that is included as part of a plugin by:

  1. Enabling the plugin.

  2. Adding the class name as detailed in the plugin’s readme to the filter.classes.

e.g. Add the example plugin custom filter with class name com.example.customPluginFilter to the default filter chain:

filter-classes-plugin-example.png

This runs the filters as defined in the default filters section (above) then feeds the document (with fixed title) into the custom plugin filter.

Custom groovy filters

Custom groovy filters are not available when using Funnelback in the Squiz Experience Cloud.

Custom Groovy filters can be installed to be used with a data source by:

  1. Adding the custom filter files (including any dependencies) and sub-folder structure to the data source’s @groovy folder.

  2. Adding the class name as detailed in the custom filter’s documentation to the filter.classes. The class names used by custom Groovy filters are similar to the class names used by plugin filters.

Backend access to the Funnelback server is required to create custom Groovy filters.

Shared Groovy filters

This feature is only available to system administrators

Custom Groovy filters can also be shared between all collections on a server by installing the filter into the global Groovy folder.

Custom Groovy filters can be installed globally by:

  1. Adding the custom filter files (including any dependencies) and sub-folder structure to $SEARCH_HOME/lib/java/groovy.

  2. Adding the class name as detailed in the custom filter’s documentation to the filter.classes. The class names used by custom Groovy filters are similar to the class names used by plugin filters.

© 2015- Squiz Pty Ltd