Writing generic document filters

This guide covers generic document filters. See: Writing custom HTML document (jsoup) filters for an equivalent guide for HTML documents, or Transform or analyze content before it is indexed if you are unsure about different types of filters.

A document filter can be developed by writing a plugin that implements one of the plugin framework’s filtering interfaces.

A plugin can implement one or more document filters which enable documents to be manipulated before they are indexed by Funnelback.

For developers that are familiar with Funnelback’s deprecated custom Groovy filters, the process of implementing a plugin filter is almost identical, however it must be written in pure java rather than in Groovy.

Plugin scopes

The plugin scope for a plugin that implements filtering must include the runs on datasource scope.

Maven archetype template options

A document filter plugin template can be generated using the Maven archetype template

  • When using the interactive mode of Maven archetype template:

    • type true for Define value for property 'filtering'

    • type true for Define value for property 'runs-on-datasource'

  • When using the non-interactive mode of Maven archetype template:

    • set the flag -Dfiltering=true

    • set the flag -Druns-on-datasource=true

If you create a filter that extracts and adds metadata it is likely you will want to enable indexing templates to store metadata mappings.

Interface methods

Each filter implemented by a plugin must extend one of the Funnelback’s three filter interfaces: StringDocumentFilter, BytesDocumentFilter or Filter. The interface to implement will depend on whether the filter requires access to the document’s content, and if the content should be interpreted as text or binary data.

StringDocumentFilter

Processes a text document as a UTF-8 encoded text string. For filtering string (text/non-binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.

BytesDocumentFilter

Processes a binary document as a raw bytes stream. For filtering binary content. See Converting raw byte (binary) documents to strings for an example.

Filter

Processes a document with no access to content. Used for filters that do not edit or read the document content but should run on all documents. Can read/write the document metadata object. See Removing a document for an example.

Providing configuration for a filter

Custom filters can be configured via configuration keys set in the data source configuration, or via custom configuration files which are saved with the data source configuration.

The plugin framework provides built-in support for configuration keys and configuration files. This information can be used to configure the filter so that it is tailored for the data source on which it is enabled.

Reading configuration keys

Configuration keys can be used to provide simple key/value pair configuration for a filter.

Reading configuration from a file

A configuration file can be used if your filter has advanced configuration needs, such as providing sets of rules to configure a filter. Configuration files can be specified as either plain text or JSON.

Controlling pre-filter checks

For document filters that implement the StringDocumentFilter or BytesDocumentFilter interface, a pre-filter check is used as a pre-condition to determine if the filter should run. This is commonly used to ensure the filter is only run on a document of a specific type, though any custom logic can be implemented here.

Restricting a filter to a document type is commonly achieved by either checking the document’s mime type. Three built-in functions are available to assist with checking for HTML, XML or JSON documents.

  • document.getDocumentType().isHTML() returns true if the document is a HTML document.

  • document.getDocumentType().isJSON() returns true if the document is a JSON document.

  • document.getDocumentType().isXML() returns true if the document is a XML document.

You are not limited to these functions and can implement whatever conditional logic is appropriate for the filter. However, if you wish to limit your filter to these types you should use these built-in conditions so that consistent behavior applies across filters.

Example: Filter all documents

Returning ATTEMPT_FILTER as a pre-filter check will ensure the filter runs for all documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
    return PreFilterCheck.ATTEMPT_FILTER;
}

Example: Only run a generic document filter on XML documents

The following pre-filter check ensures that the filter applies only to XML documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
  // Only run this filter on XML documents
  if (document.getDocumentType().isXML()) { (1)
    return PreFilterCheck.ATTEMPT_FILTER; (2)
  }
  return PreFilterCheck.SKIP_FILTER; (3)
}
1 If the document type is XML
2 Filter the document
3 Skip the filter - don’t run this filter on the document and pass it on to the next filter in the chain.

Metadata handling within generic document filters

A very common use case of custom filters involves the extraction or modification of metadata associated with the document.

The main filter chain provides a metadata multimap that is passed between filters in the chain.

This metadata multimap holds metadata that is added via metadata.put() calls in any of the generic document filters and enables downstream filters to directly access metadata added via a previous filter.

The metadata multimap does not include any metadata that is embedded within the document such as HTML <meta> tags, or metadata added via HTML document (JSoup) filters. At the conclusion of the filter chain the additional metadata is written as Warc header information when storing the filtered document on disk.

Initialize the metadata object

In order to access metadata within a filter, the filter code needs to request a copy of the metadata object.

The getCopyOfMetadata() method can be used within a filter to populate a metadata object with any metadata that has been added to the document in previous filters.

ListMultimap<String, String> metadata = document.getCopyOfMetadata();

Adding metadata to the metadata object

Once you have a metadata object, you can add metadata to it using the put() method.

e.g. add an author metadata value of John Smith:

metadata.put("author", "John Smith");

Automatically setting up metadata mappings

When developing a filter plugin you can also implement the plugin indexing interface which provides methods for automatically configuring metadata mappings.

Use this if your filter is adding metadata that you need to map as part of the setup of your data source.

Logging

Log messages for filtering will appear in the gatherer’s filter logs.

Determining the filter class to add to the filter chain

The filter class name to add to your filter chain is determined by concatenating the filter’s package and class names.

e.g.

package com.example.pluginexamples; (1)

public class ExampleFilter implements StringDocumentFilter { (2)

    @Override
    public FilterResult filterAsStringDocument(StringDocument document, FilterContext filterContext) {
        // Filter logic
    }
}
1 The package name is com.example.pluginexamples
2 The public class name is ExampleFilter

This is added to the data source’s filter chain by adding com.example.pluginexamples.ExampleFilter to the filter.classes for the data source.

Generic document filter examples