Groovy Filters

Introduction

If you have a need to modify data before it is indexed by Funnelback you can write a Groovy script to do this. The script will be called as part of the filtering framework. This document outlines the basics for Groovy filter development. See filter examples for a list of examples demonstrating other features of the filter fraemwork.

Groovy

The Groovy programming language is very similar to Java, and in fact, a valid Java class is also a valid Groovy class. Groovy does, however, provide a lot of syntaxic sugar to avoid some of the verboseness of Java. One good example is the Groovy syntax for regular expressions.

Groovy filter file locations

The Groovy filters should be placed in:

  • $SEARCH_HOME/lib/java/groovy/ or
  • $SEARCH_HOME/conf/<collection>/@groovy/ (v14.2 and above)

Placing filters in these locations will ensure that the filter is loaded as expected and will facilitate the configuration of the logging system.

Groovy filters location must reflect their package naming convention, and the file name should be the same as the class name.

For example, if a filter is defined to be a part of the com.myfilters package (i.e. package com.myfilters;) it must be stored in either of the following directories:

  • $SEARCH_HOME/lib/java/groovy/com/myfilters/ExampleGroovyFilter.groovy

or

  • $SEARCH_HOME/conf/<collection>/@groovy/com/myfilters/ExampleGroovyFilter.groovy

Filters placed under $SEARCH_HOME/lib/java/groovy/ will be available for all collections, whereas filters under the @groovy folder will only be available to the collection they belong to.

Basic Example

Here is a very simple example Groovy filter script to use as a starting point. Copy this filter into $SEARCH_HOME/conf/<collection>/@groovy/com/myfilters/ExampleGroovyFilter.groovy

// Filename:  ExampleGroovyFilter.groovy
//
// This package declaration should match the .groovy file location
// on disk as well as the name used for the filter.classes collection.cfg setting.
package com.myfilters;

import java.net.URI;
import org.junit.*;
import org.junit.Test;
import com.funnelback.filter.api.*;
import com.funnelback.filter.api.documents.*;
import com.funnelback.filter.api.filters.*;
import com.funnelback.filter.api.mock.*;

//This annotation provides a logger under the "log" name
@groovy.util.logging.Log4j2
public class ExampleGroovyFilter implements StringDocumentFilter {

    /*
     * The result of this determines if the filter is run. In this example the filter is 
     * only run if the document type, derived from the Content-Type returned by the web
     * server is HTML. If the document is not HTML the filter will be skipped.
     */
    @Override
    public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
        if(document.getDocumentType().isHTML()) {
            return PreFilterCheck.ATTEMPT_FILTER;
        }
        return PreFilterCheck.SKIP_FILTER;
    }

    /*
     * This contains the logic of the filtering. The first line uses the logger to log the URL
     * of the document we are filtering. After that we prefix the document's content with
     * 'Example: ' and create a new document with that content. We then return the new 
     * filtered document.
     */
    @Override
    public FilterResult filterAsStringDocument(StringDocument document, FilterContext context) {
        //Log what document we are filtering
        log.info("Filtering document: " + document.getURI());

        //Prepend Example to the document content.
        String newContent = "Example: " + document.getContentAsString();

        //Create a clone of the existing document with the new content.
        StringDocument newDocument = document.cloneWithStringContent(document.getDocumentType(), newContent);

        //Return the new document we created with the new content.
        return FilterResult.of(newDocument);
    }

    /*
     * This inner class contains tests for the filter.
     * 
     * Methods in this class annotated with @Test will be run by main.
     */
    public static class FilterTest {


        @Test
        public void exampleTest() throws Exception {
            //This creates the dummy input document. The input document has the URI set
            //to 'http://foo.com/', the document type is set to HTML and the content
            //of the document is set to 'hello'
            StringDocument inputDoc = MockDocuments.mockEmptyStringDoc()
                .cloneWithURI(new URI("http://foo.com/"))
                .cloneWithStringContent(DocumentType.MIME_HTML_TEXT, "hello");

            //This creates an instance of the filter and runs it with the input document we created earlier
            //Ignore the MockFilterContext for now.
            FilterResult filterResult = new ExampleGroovyFilter().filter(inputDoc, MockFilterContext.getEmptyContext());

            //Get the resulting document.
            //As filters can return zero, one or more documents we must get the resulting filtered
            //document for the list of filtered documents. Here we assume the list will contain one
            //document.
            StringDocument filteredDocument = (StringDocument) filterResult.getFilteredDocuments().get(0);

            //Finally we check that the filter has modified the content of the document using a JUnit
            //assert statement.
            Assert.assertEquals("'Example:' should be prepended to the document content.", 
                "Example: hello", filteredDocument.getContentAsString());
        }
    }

    //Running the main method will execute the test methods.
    public static void main(String[] args) throws Exception {
        FilterTestRunner.runTests(FilterTest.class);
    }
}

Running filter tests on the command line

Before adding your custom filter to your collections filter chain you should always check that the tests are passing. To run the tests for our Groovy filter run:

$SEARCH_HOME/linbin/java/bin/java -cp $SEARCH_HOME/lib/java/all/*:$SEARCH_HOME/tools/groovy/bin/groovy groovy.ui.GroovyMain $SEARCH_HOME/conf/<collection>/@groovy/com/myfilters/ExampleGroovyFilter.groovy

The output should show the single test is passing.

Adding the filter to the filter chain

To use a Groovy filtering script, you must first modify your collection's filter.classes setting to include the script in the list of filters to use for the collection. The default setting is currently...

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider

...and to allow our new filter to have the last word (i.e. be run after the document fixer) we should change this to...

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.myfilters.ExampleGroovyFilter

Modifying the filter

The example given above is a solid starting point, and the sections which should require changing for most filters would be the specific implementation of canFilter() which controls if the filter should be run on the document based on the document type, URI or document metadata (Document content should never be inspected in this method to avoid expensive copy operations). The other method filterAsStringDocument() holds the logic for filtering the document. In our case we modify the document content however we can also modify the URI, metadata, document type, charset (in some filters). Further we can even use the filters to split documents into multipe documents or remove documents. See the filter examples for a list of examples demonstrating the different features the filter framework provides.

It is always best to write tests which exercise all methods in your filter. For simplicity we have included the tests within the filter itself, this is not required and may not be suitable to your development environment.

Javadoc

Although the filter examples cover most of what the filters can do reference documentation is also privided see Filter Javadoc.

Logging

When developing and debugging a filter it's usually convenient to be able to write debug messages. To have the debug messages written to the collection filter.log (or to the inline filter log, depending on your collection settings) you should use the Log4j logging framework. To do so:

  • Annotate your class with @groovy.util.logging.Log4j2
  • Use this object into your method to output debug messages: log.info("Filtering content: [{}]", content)

Note: The default configuration of log messages is set to output the INFO level of above. That means that your messages will only appear if:

  • You use log.error() or log.fatal()
  • You re-configure the logging system for your specific namespace if you want to use log.debug() or log.trace().

The logging system can be configured using SEARCH_HOME/conf/log4j2.xml.default as a starting point:

  • Either copy it to SEARCH_HOME/conf/log4j2.xml to have your configuration apply to all collections
  • Or copy it to SEARCH_HOME/conf/<collection>/log4j2.xml to apply it to this collection only.

Caveats

Incremental crawls

During an incremental crawl only the documents that are re-crawled will go through the filter phase. Any document that has not changed will not be re-filtered. Custom filters need to account for that, in case the filter is generating data that needs to apply to the whole index, such as a gscopes.cfg file for example. The usual pattern in that situation is to re-read the generated data from the previous crawl, update only the records/urls that are actually filtered, and write the file back containing the previous + modified data.

Filter examples

See Filter examples for a list of example filters which explore the features provided by the filter framework.

top