Writing generic document filters

This guide covers generic document filters. See: Writing custom HTML document (jsoup) filters for an equivalent guide for HTML documents, or Transform or analyze content before it is indexed if you are unsure about different types of filters.

A custom filter can be developed by writing a plugin that implements one of the plugin framework’s filtering interfaces.

A plugin can implement one or more document filters which enable documents to be manipulated before they are indexed by Funnelback.

For developers that are familiar with Funnelback’s deprecated custom Groovy filters, the process of implementing a plugin filter is almost identical, however it must be written in pure java rather than in Groovy.

Interface methods

Each filter implemented by a plugin must extend one of the Funnelback’s three filter interfaces: StringDocumentFilter, BytesDocumentFilter or Filter. The interface to implement will depend on whether the filter requires access to the document’s content, and if the content should be interpreted as text or binary data.

StringDocumentFilter

Processes a text document as a text string. For filtering string (non binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.

BytesDocumentFilter

Processes a binary document as a raw bytes stream. For filtering binary content. See Converting raw byte (binary) documents to strings for an example.

Filter

Processes a document with no access to content. Used for filters that do not edit or read the document content but should run on all documents. Can read/write the document metadata object. See Removing a document for an example.

Constructors

A constructor can be used to perform operations that should occur once, when the filter chain is initialized. For example, reading configuration into variables that can be accessed from the filter.

Generally you can use a no argument constructor, however, other constructors are available if you need access to the search home or collection name.

The filter framework will automatically call one of the constructors listed below. For our MyFilter example from above the following constructors could be used:

No argument constructor

A constructor that takes no arguments.

public class MyFilter implements Filter {
    public MyFilter() {
       // Your constructor code here.
    }
}

Constructor given search home and data source ID.

This constructor is given the search home variable as a java.io.File type and the data source ID as a String. This constructor will be called in preference to the other constructor.

import java.io.File;

public class MyFilter implements Filter {
    public MyFilter(File searchHome, String dataSourceId) {
       // Your constructor code here.
    }

Providing configuration for a filter

Custom filters can be configured via configuration keys set in the data source configuration, or via custom configuration files which are saved with the data source configuration.

The plugin framework provides built-in support for configuration keys and configuration files. This information can be used to configure the filter so that it is tailored for the data source on which it is enabled.

Configuration should be loaded from the filter’s constructor so that it is only loaded once for an update. If the configuration is loaded within the main filter method then the configuration will be loaded for each document that is processed.

Reading configuration keys

Configuration keys can be used to provide simple key/value pair configuration for a filter.

Reading configuration from a file

A configuration file can be used if your filter has advanced configuration needs, such as providing sets of rules to configure a file. Configuration files can be specified as either plain text or JSON.

Controlling pre-filter checks

For document filters that implement the StringDocumentFilter or BytesDocumentFilter interface, a pre-filter check is used as a pre-condition to determine if the filter should run. This is commonly used to ensure the filter is only run on a document of a specific type, though any custom logic can be implemented here.

Restricting a filter to a document type is commonly achieved by either checking the document’s mime type. Three built-in functions are available to assist with checking for HTML, XML or JSON documents.

  • document.getDocumentType().isHTML() returns true if the document is a HTML document.

  • document.getDocumentType().isJSON() returns true if the document is a JSON document.

  • document.getDocumentType().isXML() returns true if the document is a XML document.

Example: Only run a generic document filter on XML documents

The following pre-filter check ensures that the filter applies only to XML documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
  // Only run this filter on XML documents
  if (document.getDocumentType().isXML()) {
    return PreFilterCheck.ATTEMPT_FILTER;
  }
  return PreFilterCheck.SKIP_FILTER;
}

Metadata handling within generic document filters

A very common use case of custom filters involves the extraction or modification of metadata associated with the document.

The main filter chain provides a metadata multimap that is passed between filters in the chain.

This metadata multimap holds metadata that is added via metadata.put() calls in any of the generic document filters and enables downstream filters to directly access metadata added via a previous filter.

The metadata multimap does not include any metadata that is embedded within the document such as HTML <meta> tags, or metadata added via HTML document (JSoup) filters. At the conclusion of the filter chain the additional metadata is written as Warc header information when storing the filtered document on disk.

Initialize the metadata object

In order to access metadata within a filter, the filter code needs to request a copy of the metadata object.

The getCopyOfMetadata() method can be used within a filter to populate a metadata object with any metadata that has been added to the document in previous filters.

ListMultimap<String, String> metadata = document.getCopyOfMetadata();

Adding metadata to the metadata object

Once you have a metadata object, you can add metadata to it using the put() method.

e.g. add an author metadata value of John Smith:

metadata.put("author", "John Smith");

Automatically setting up metadata mappings

When developing a filter plugin you can also implement the plugin indexing interface which provides methods for automatically configuring metadata mappings.

Use this if your filter is adding metadata that you need to map as part of the setup of your data source.

Logging

Log messages for filtering will appear in the gatherer’s filter logs.

Generic document filter examples