Writing custom Groovy filters

This feature is not available to users of the Squiz Experience Cloud version of Funnelback. Equivalent functionality is available using plugins.

Filtering is the process of transforming gathered content into content suitable for indexing by Funnelback.

A filter takes a stream of data representing the document content and metadata and implements custom logic to transform or extract information from the content. The transformed content and metadata is then output and fed into another filter, or written to disk for indexing.

Before you start

In particular:

  • Understand how the filter chain functions.

  • Understand the difference between generic document filters and HTML document (jsoup) filters.

  • Ensure you are familiar with the Groovy programming language.

Writing custom Groovy filters

Custom Groovy filters must implement one of the following interfaces, which take the input document content in a number of forms:

Filter interface Input content Description

StringDocumentFilter

Document content as a string

For filtering string (non binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.

BytesDocumentFilter

Document content as raw bytes

For filtering binary content. See Converting raw byte (binary) documents to strings for an example.

Filter

Document with no access to content

Used for filters that do not edit or read the document content but should run on all documents. Can read/write the document metadata object. See Removing a document for an example.

For HTML document manipulation use a HTML document (jsoup) filter that implements the following interface:

Filter interface Input content Description

IJSoupFilter

Content as a jsoup object and metadata

For filtering HTML documents. See jsoup filter.

Detailed filter framework documentation is included in the Javadoc documentation.

A series of filter examples are presented below to illustrate implementation of various filter types.

Naming and locations.

Collection-level groovy filters should be stored in the collection’s @groovy folder ($SEARCH_HOME/conf/COLLECTION_NAME/@groovy) under a folder structure that corresponds to the filter’s package name.

The folder should contain the .groovy files, any compiled .class files and any required dependencies.

Example

If our filter looked like:

package com.example;

import com.funnelback.filter.api.filters.*;
import com.funnelback.common.filter.jsoup.*;

public class MyFilter implements Filter {

As the package name is com.example and the class name of the filter is MyFilter then the location for that filter must be in:

$SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/example/MyFilter.groovy

Configuring the filter to be used.

Typically you can just add your custom Groovy filter to the filter chain by using its fully qualified name that is its package name followed by the class name. For the above example, the fully qualified name is com.example.MyFilter. This can be added to the filter chain in the collection configuration by appending a comma and the fully qualified name to the filter.classes option.

Example

filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider,com.example.MyFilter

See filter.classes for more information about configuring the filter chain.

Importing external dependencies for use with Groovy scripts

Java classes can be utilised from Groovy scripts via import statements.

Dependencies should be grabbed before they are imported to minimise any chance of the groovy script breaking after an upgrade.

Example

Ensure that the com.twitter.Extractor class can be imported from the twitter-text-1.14.jar.

@Grab(group='com.twitter.twittertext', module='twitter-text', version='3.0.1')
@GrabExclude('org.jsoup:jsoup') // Don't want conflicting versions

import com.twitter.twittertext.Extractor

Ensuring a filter only runs on certain file types

For filters that implement StringDocumentFilter or BytesDocumentFilter, a pre-filter check is used to determine if the filter should run. This is commonly set to only run on a document of a specific type, though any custom logic can be implemented here.

Restricting a filter to a document type is commonly achieved by either checking the document’s mime type. Three build in functions are available to assist with checking for HTML, XML or JSON documents.

  • document.getDocumentType().isHTML() returns true if the document is a HTML document.

  • document.getDocumentType().isJSON() returns true if the document is a JSON document.

  • document.getDocumentType().isXML() returns true if the document is a XML document.

Example

The following pre-filter check ensures that the filter applies only to HTML documents.

public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
  // Only run this filter on HTML documents
  if (document.getDocumentType().isHTML()) {
    return PreFilterCheck.ATTEMPT_FILTER;
  }
  return PreFilterCheck.SKIP_FILTER;
}

Logging

When developing and debugging a filter it’s usually convenient to be able to write debug messages. To have the debug messages written to the collection filter.log (or to the inline filter log, depending on your collection settings) you should use the Log4j logging framework. To do so:

  • Annotate your class with @groovy.util.logging.Log4j2

  • Use this object into your method to output debug messages: log.info("Filtering content: [{}]", content)

The default configuration of log messages is set to output the WARN level of above. That means that your messages will only appear if:
  • You use log.warn() or log.error() or log.fatal()

  • You re-configure the logging system for your specific namespace if you want to use log.info() or log.debug() or log.trace().

If the package of the class of the filter is within the "com.funnelback" namespace, the default log level of messages is set to INFO.

The logging system can be configured using SEARCH_HOME/conf/log4j2.xml.default as a starting point:

  • Either copy it to SEARCH_HOME/conf/log4j2.xml to have your configuration apply to all collections

  • Or copy it to SEARCH_HOME/conf/<collection>/log4j2.xml to apply it to this collection only.

Testing filters

For testing of jsoup filters see: testing jsoup filters which support a simplified test system.

For general filters, tests can be added to the FilterTest inner class contained within the filter. Methods that are annotated with @Test will be run when executing the groovy filter on the command line.

See: Basic filter example for an example of implementing and running tests on a filter.

Running filter tests on the command line

A filter should be pass all tests before it is added to the collection’s filter chain. To run the tests for a Groovy filter run:

$SEARCH_HOME/linbin/java/bin/java -cp $SEARCH_HOME/lib/java/all/*:$SEARCH_HOME/tools/groovy/bin/groovy groovy.ui.GroovyMain $SEARCH_HOME/conf/COLLECTION_NAME/@groovy/com/myfilters/ExampleGroovyFilter.groovy

The output should show the tests are passing.

Constructors

Generally you can use a no argument constructor, however, other constructors are available if you need access to the search home or collection name. The filter framework will automatically call one of the constructors listed below. For our MyFilter example from above the following constructors could be used:

No argument constructor

A constructor that takes no arguments.

public class MyFilter implements Filter {
    public MyFilter() {
       // Your constructor code here.
    }

Constructor given search home and collection name.

This constructor is given the search home variable as a java.io.File type and the collection name as a String. This constructor will be called in preference to the other constructor.

import java.io.File;

public class MyFilter implements Filter {
    public MyFilter(File searchHome, String collectionName) {
       // Your constructor code here.
    }

Filter examples

© 2015- Squiz Pty Ltd