Filter Classes (collection.cfg)
The filtering framework in Funnelback optionally allows for the specification of separate Java classes to provide filtering of content. Funnelback provides several filters for filtering common file formats. If specialised filtering is required however, the filters must be specified here.
Any filters specified here must implement the
com.funnelback.filter.api.filters.Filter (see example filter) or the
com.funnelback.filter.IFilterProvider interface (a summary of which is provided below).
The names given in this configuration option should be fully qualified Java class names. The only exception to this is that if a class name is specified by itself (i.e. without it's associated package), it is assumed to be part of the
Choices and Chains
It is usual to provide different filter providers to filter different types of documents. If you specify your filter class as a comma separated list (a 'Choice' set), then one filter provider class from the set will be chosen to filter each individual document, based on the type information returned by determineDocumentType. The Funnelback provided filter classes generally determine document type by checking the file extension.
It is also sometimes required to pass the output of one filter into the input of another. It is possible to 'Chain' filter classes together in this way. To do so, specify the filter classes as a colon (':') separated list. This will cause subsequent filters to modify the output of previous filters, as long as the output type of the previous filter matches the input type of the subsequent filter. Unusable filters in a chain will be skipped.
Funnelback ships with the following filters:
Default filter to convert binary files (Microsoft Office files, PDF files, etc.) to HTML.
Filter to combine content with extra metadata files (
.fun.txt). Works with text and HTML content only.
A filter which converts records in a CSV document to multiple XML documents, see CSV to XML filter
A filter which sets the mime type of all documents to CSV.
A filter which sets the mime type of all documents to JSON.
A filter which sets the mime type of all documents to XML.
A filter which causes the filter.jsoup.classes filters to be run on any HTML documents.
A filter which converts JSON documents to XML. The filter will only run if the document type is JSON see changing document type for an example of changing the document type. If all documents are JSON you can force the document type to JSON with the
ForceJSONMime filter e.g.
Filter that attempts to find better a title for a document by inspecting it's content. Works with HTML content only.
<meta> tag normaliser / replacer for HTML content.
Use an external filter (See textify.cfg for more details)
Use to identify regions to exclude from indexing tags in HTML content using CSS selectors.
Use to replace undesirable characters in documents with spaces based on their Unicode blocks. See the filter.text-cleanup.ranges-to-replace config setting.
Used for generic filtering workflows (e.g. inserting metadata based on URL patterns, performing string replacements, etc.)
Used for detecting whether a URL contains textual content. Used by the Content Auditor. Default value
filter.classes=com.company.CustomFilterProvider filter.classes=com.company.CustomFilterProvider2,com.company.CustomFilterProvider1 filter.classes=com.company.CustomChain1:com.company.CustomChain2:com.company.CustomChain3 filter.classes=TikaFilterProvider,ExternalFilterProvider:com.company.CustomChain2:com.company.CustomChain3
Disable the document title fixer filter:
- Choice sets are checked in reverse order. i.e. filters that appear last in the list will be used first if they are capable of filtering a given document type.
- When specifying a combination of Choice and Chain filters, ',' has higher precedence than ':'. In other words it is possible to have a chain of choice filters, but it is not possible to have a choice between several chains of filters.