Filter Classes (collection.cfg)

Description

The filtering framework in Funnelback optionally allows for the specification of separate Java classes to provide filtering of content. Funnelback provides several filters for filtering common file formats. If specialised filtering is required however, the filters must be specified here.

Any filters specified here must implement the com.funnelback.filter.api.filters.Filter (see example filter) or the com.funnelback.filter.IFilterProvider interface (a summary of which is provided below).

Filter names

The names given in this configuration option should be fully qualified Java class names. The only exception to this is that if a class name is specified by itself (i.e. without it's associated package), it is assumed to be part of the com.funnelback.common.filter package.

Choices and Chains

It is usual to provide different filter providers to filter different types of documents. If you specify your filter class as a comma separated list (a 'Choice' set), then one filter provider class from the set will be chosen to filter each individual document, based on the type information returned by determineDocumentType. The Funnelback provided filter classes generally determine document type by checking the file extension.

It is also sometimes required to pass the output of one filter into the input of another. It is possible to 'Chain' filter classes together in this way. To do so, specify the filter classes as a colon (':') separated list. This will cause subsequent filters to modify the output of previous filters, as long as the output type of the previous filter matches the input type of the subsequent filter. Unusable filters in a chain will be skipped.

Built-in filters

Funnelback ships with the following filters:

TikaFilterProvider

Default filter to convert binary files (Microsoft Office files, PDF files, etc.) to HTML.

CombinerFilterProvider

Filter to combine content with extra metadata files (.pan.txt or .fun.txt). Works with text and HTML content only.

CSVToXML

A filter which converts records in a CSV document to multiple XML documents, see CSV to XML filter

ForceCSVMime

A filter which sets the mime type of all documents to CSV.

ForceJSONMime

A filter which sets the mime type of all documents to JSON.

ForceXMLMime

A filter which sets the mime type of all documents to XML.

JSoupProcessingFilterProvider

A filter which causes the filter.jsoup.classes filters to be run on any HTML documents.

JSONToXML

A filter which converts JSON documents to XML. The filter will only run if the document type is JSON see changing document type for an example of changing the document type. If all documents are JSON you can force the document type to JSON with the ForceJSONMime filter e.g. ForceJSONMime:JSONToXML.

DocumentFixerFilterProvider

Filter that attempts to find better a title for a document by inspecting it's content. Works with HTML content only.

MetadataNormaliser

Generic <meta> tag normaliser / replacer for HTML content.

ExternalFilterProvider

Use an external filter (See textify.cfg for more details)

InjectNoIndexFilterProvider

Use to identify regions to exclude from indexing tags in HTML content using CSS selectors.

TextCleanupFilterProvider

Use to replace undesirable characters in documents with spaces based on their Unicode blocks. See the filter.text-cleanup.ranges-to-replace config setting.

WorkflowFilter

Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.)

TextDetectionFilterProvider

Used for detecting whether a URL contains textual content. Used by the Content Auditor. Default value

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider

Examples

filter.classes=com.company.CustomFilterProvider
filter.classes=com.company.CustomFilterProvider2,com.company.CustomFilterProvider1
filter.classes=com.company.CustomChain1:com.company.CustomChain2:com.company.CustomChain3
filter.classes=TikaFilterProvider,ExternalFilterProvider:com.company.CustomChain2:com.company.CustomChain3

Disable the document title fixer filter:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider

Caveats

  • Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.
  • When specifying a combination of Choice and Chain filters, ',' has higher precedence than ':'. In other words it is possible to have a chain of choice filters, but it is not possible to have a choice between several chains of filters.

See also

top