Writing generic document filters

This guide covers generic document filters. See: Writing custom HTML document (jsoup) filters for an equivalent guide for HTML documents, or Transform or analyze content before it is indexed if you are unsure about different types of filters.

A document filter can be developed by writing a plugin that implements one of the plugin framework’s filtering interfaces.

A plugin can implement one or more document filters which enable documents to be manipulated before they are indexed by Funnelback.

For developers that are familiar with Funnelback’s deprecated custom Groovy filters, the process of implementing a plugin filter is almost identical, however it must be written in pure java rather than in Groovy.

Plugin scopes

The plugin scope for a plugin that implements filtering must include the runs on datasource scope.

Maven archetype template options

A document filter plugin template can be generated using the Maven archetype template

When using the interactive mode of Maven archetype template:
- type true for Define value for property 'filtering'
- type true for Define value for property 'runs-on-datasource'
When using the non-interactive mode of Maven archetype template:
- set the flag -Dfiltering=true
- set the flag -Druns-on-datasource=true

If you create a filter that extracts and adds metadata it is likely you will want to enable indexing templates to store metadata mappings.

Interface methods

Each filter implemented by a plugin must extend one of the Funnelback’s three filter interfaces: StringDocumentFilter, BytesDocumentFilter or Filter. The interface to implement will depend on whether the filter requires access to the document’s content, and if the content should be interpreted as text or binary data.

StringDocumentFilter: Processes a text document as a UTF-8 encoded text string. For filtering string (text/non-binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.
BytesDocumentFilter: Processes a binary document as a raw bytes stream. For filtering binary content. See Converting raw byte (binary) documents to strings for an example.
Filter: Processes a document with no access to content. Used for filters that do not edit or read the document content but should run on all documents. Can read/write the document metadata object. See Removing a document for an example.

Providing configuration for a filter

Custom filters can be configured via configuration keys set in the data source configuration, or via custom configuration files which are saved with the data source configuration.

The plugin framework provides built-in support for configuration keys and configuration files. This information can be used to configure the filter so that it is tailored for the data source on which it is enabled.

See: Defining plugin configuration keys

Reading configuration keys

Configuration keys can be used to provide simple key/value pair configuration for a filter.

Reading configuration from a file

A configuration file can be used if your filter has advanced configuration needs, such as providing sets of rules to configure a filter. Configuration files can be specified as either plain text or JSON.

Controlling pre-filter checks

For document filters that implement the StringDocumentFilter or BytesDocumentFilter interface, a pre-filter check is used as a pre-condition to determine if the filter should run. This is commonly used to ensure the filter is only run on a document of a specific type, though any custom logic can be implemented here.

Restricting a filter to a document type is commonly achieved by either checking the document’s mime type. Three built-in functions are available to assist with checking for HTML, XML or JSON documents.

document.getDocumentType().isHTML() returns true if the document is a HTML document.
document.getDocumentType().isJSON() returns true if the document is a JSON document.
document.getDocumentType().isXML() returns true if the document is a XML document.

You are not limited to these functions and can implement whatever conditional logic is appropriate for the filter. However, if you wish to limit your filter to these types you should use these built-in conditions so that consistent behavior applies across filters.

Example: Filter all documents

Returning ATTEMPT_FILTER as a pre-filter check will ensure the filter runs for all documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
    return PreFilterCheck.ATTEMPT_FILTER;
}

Example: Only run a generic document filter on XML documents

The following pre-filter check ensures that the filter applies only to XML documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
  // Only run this filter on XML documents
  if (document.getDocumentType().isXML()) { (1)
    return PreFilterCheck.ATTEMPT_FILTER; (2)
  }
  return PreFilterCheck.SKIP_FILTER; (3)
}

1	If the document type is XML
2	Filter the document
3	Skip the filter - don’t run this filter on the document and pass it on to the next filter in the chain.

Metadata handling within generic document filters

A very common use case of custom filters involves the extraction or modification of metadata associated with the document.

The main filter chain provides a metadata multimap that is passed between filters in the chain.

This metadata multimap holds metadata that is added via metadata.put() calls in any of the generic document filters and enables downstream filters to directly access metadata added via a previous filter.

The metadata multimap does not include any metadata that is embedded within the document such as HTML <meta> tags, or metadata added via HTML document (JSoup) filters. At the conclusion of the filter chain the additional metadata is written as Warc header information when storing the filtered document on disk.

Initialize the metadata object

In order to access metadata within a filter, the filter code needs to request a copy of the metadata object.

The getCopyOfMetadata() method can be used within a filter to populate a metadata object with any metadata that has been added to the document in previous filters.

ListMultimap<String, String> metadata = document.getCopyOfMetadata();

Adding metadata to the metadata object

Once you have a metadata object, you can add metadata to it using the put() method.

e.g. add an author metadata value of John Smith:

metadata.put("author", "John Smith");

Automatically setting up metadata mappings

When developing a filter plugin you can also implement the plugin indexing interface which provides methods for automatically configuring metadata mappings.

Use this if your filter is adding metadata that you need to map as part of the setup of your data source.

Logging

Log messages for filtering will appear in the gatherer’s filter logs.

See: debugging and logging guide

Determining the filter class to add to the filter chain

The filter class name to add to your filter chain is determined by concatenating the filter’s package and class names.

e.g.

package com.example.pluginexamples; (1)

public class ExampleFilter implements StringDocumentFilter { (2)

    @Override
    public FilterResult filterAsStringDocument(StringDocument document, FilterContext filterContext) {
        // Filter logic
    }
}

1	The package name is `com.example.pluginexamples`
2	The public class name is `ExampleFilter`

This is added to the data source’s filter chain by adding com.example.pluginexamples.ExampleFilter to the filter.classes for the data source.

Generic document filter examples

Manipulating string (non binary) documents: A simple way of filtering a document as a string.
Filters which read collection configuration options: Demonstrates a filter which reads options from the data source configuration.
Filters which read a custom configuration file: Shows how to read from a custom configuration file for jsoup and general document filters.
Adding metadata based on document content: Demonstrates adding values to the document content based on the document content.
Adding metadata to all documents: Demonstrates adding metadata to a document regardless of the document content.
Accessing the filtered metadata: Shows how to iterate over the multimap containing metadata added via filters.
Modifying document URLs: Demonstrates modifying the document URL for any document.
Filtering a document into multiple documents: Demonstrates spiting a single input document into multiple documents.
Filtering a HTML document into multiple documents: Demonstrates splitting a single HTML document into multiple HTML documents.
Removing a document: Demonstrates removing a document using the filters, typically resulting in that document not being available in the search index.
Altering the document type: Demonstrates fixing the document type based on the content of the document.
Converting raw byte (binary) documents to Strings: This might be done when converting from a binary format such as pdf to text format such as HTML or XML.
Manipulating raw byte (binary) documents: Demonstrates how to base64 encode a binary document.

Help Center

Menu

Writing generic document filters

Plugin scopes

Maven archetype template options

Interface methods

Providing configuration for a filter

Reading configuration keys

Reading configuration from a file

Controlling pre-filter checks

Example: Filter all documents

Example: Only run a generic document filter on XML documents

Metadata handling within generic document filters

Initialize the metadata object

Adding metadata to the metadata object

Automatically setting up metadata mappings

Logging

Determining the filter class to add to the filter chain

Generic document filter examples

See also