Funnelback logo

Documentation

CATEGORY

Groovy Filters

Introduction

If you have a need to modify data before it is indexed by Funnelback you can write a Groovy script to do this. The script will be called as part of the filtering framework.

Basic Example

Here is a very simple example groovy filtering script to give us a starting point.

// This package declaration should match the .groovy file location
 // on disk as well as the named used for the filter.classes parameter.
 package au.gov.example;
  
 @groovy.transform.InheritConstructors
 public class ExampleGroovyFilter extends com.funnelback.common.filter.ScriptFilterProvider {
   // We get a documentType (file extension) to decide if
   // we should filter this document.
   // This example filters all document types.
   public Boolean isDocumentFilterable(String documentType) {
     return true;
   }
  
   // The filter method performs the actual filtering work,
   // and returns the new document content.
   // This example just prefixes "Example" to all documents
   public String filter(String input, String documentType) {
     return "Example " + input;
   }
  
   // There's also another method that will receive the document URL
   // as an additional parameter.
   // Note that the other method MUST be implemented as well.
   // In this case you can just call this one with no URL:
   // filter(input, documentType, null)
   public String filter(String input, String documentType, String url) {
     return "Example " + input;
   }
  
   // Utility method to unit-test the filter on an individual file
   // when developing, from the Groovy Console for example
   public static void main(String[] args) {
     // File to filter with test content
     def testFile = new File("C:\\Temp\\test-content.html")
  
     // Create new filter object, for a fake collection.
     def f = new FilterExample("collection-name", false)
  
     // Filter content using main filter() function
     def contentFiltered = f.filter(testFile.text, testFile.absolutePath)
  
     // OPTIONAL: Filter content with a document URL
     def contentFiltered = f.filter(testFile.text, testFile.absolutePath, "http://fake.url/file.html")
  
     // print filtered content
     print contentFiltered
  }
 }

To use a groovy filtering script, you must first modify your collection's filter.classes setting to include the script in the list of filters to use for the collection. The default setting is currently...

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider

...and to allow our new filter to have the last word (i.e. be run after the document fixer) we could change this to...

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:au.gov.example.ExampleGroovyFilter

Groovy file location

The Groovy filters should be put in SEARCH_HOME/lib/java/groovy/. While you can use any folder structure it's recommended to follow your package naming convention. That will ensure that the filter is loaded as expected and will facilitate the configuration of the logging system. The file name should be the same as the class name.

In our previous examples, the filter should be placed in: SEARCH_HOME/lib/java/groovy/au/gov/example/ExampleGroovyFilter.groovy

Building Filters

The Groovy programming language is very similar to Java, and in fact, a valid Java class is (perhaps with some obscure exceptions) also a valid Groovy class. Groovy does, however, provide a lot of shortcuts to avoid some of the verboseness of Java. One good example is the Groovy syntax for regular expressions.

The example given above is a solid starting point, and the only sections which should require changing for most filters would be the specific implementation of the isDocumentFilterable() and filter() methods - The rest is likely to be boilerplate common to all filters.

Pre-defined variables

As a result of extending the ScriptFilterProvider class, your filter will have access to the following instance variables :

  • A fileBeingFiltered string, which may contain a file object (ie. file path) indicating which file is being filtered (may be null if the filter is being run inline). It will be null if you test your script from the Groovy Console instead of running a collection filter phase.
  • A config object, which allows access to collection.cfg and global.cfg settings with a call like config.value("service_name")

Note also that the script class will have access to all the existing Funnelback Java Common classes as well as any JAR files which are added to the SEARCH_HOME/lib/java/all/ directory.

Logging

When developing and debugging a filter it's usually convenient to be able to write debug messages. To have the debug messages written to the collection filter.log (or to the inline filter log, depending on your collection settings) you should use the Log4j logging framework. To do so:

  • Import the relevant package: import org.apache.log4j.Logger
  • Declare a Logger object inside your class (but outside any method): private static final Logger logger = Logger.getLogger(MyClassName.class), replacing MyClassName.class with the actual name of your class.
  • Use this object into your method to output debug messages: logger.info("Filtering content: ["+content+"]")

Note: The default configuration of log messages is set to output the INFO level of above. That means that your messages will only appear if:

  • You use logger.error() or logger.fatal()
  • You re-configure the logging system for your specific namespace if you want to use logger.debug() or logger.trace().

The logging system can be configured using SEARCH_HOME/conf/log4j.properties.default as a starting point:

  • Either copy it to SEARCH_HOME/conf/log4j.properties to have your configuration apply to all collections
  • Or copy it to SEARCH_HOME/conf/<collection>/log4j.properties to apply it to this collection only.

Caveats

Incremental crawls

During an incremental crawl only the documents that are re-crawled will go through the filter phase. Any document that has not changed will not be re-filter again. Custom filters need to account for that, in case the filter is generating data that needs to apply to the whole index, such as a gscopes.cfg file for example. The usual pattern in that situation is to re-read the generated data from the previous crawl, update only the records/urls that are actually filtered, and write the file back containing the previous + modified data.

Non-inline filtering

Filtering can take place inline, i.e. during the crawl, or as a post-gather phase. When run as a post-gather phase, the document will be prepended with a <DOCHDR> block containing crawl metadata. When run inline, this metadata block is not available. Custom filters need to account for that condition, especially to ensure that the <DOCHDR> block is preserved and not transformed into HTML tags by HTML parsers such as JSoup.

No URL provided

In some cases the method public String filter(String input, String documentType, String url); will be called with the url parameter set to null. This can happen with some gather components for which an URL is not available at the time of the gathering (e.g. on non-web collections). This will also happen when the filter is run in a post-gather phase.

Examples

Extract Metadata Groovy Filter

Below is an example filter which takes the content of the first H1 tag in the page, the base href URL, and creats an external metadata file mapping the content of the first H1 into the x metadata class.

package com.funnelback.common.filter;
  
 import com.funnelback.common.Environment;
  
 public class ExtractMetadataExample extends com.funnelback.common.filter.ScriptFilterProvider {
  
   PrintWriter output;
  
   public ExtractMetadataExample(String collectionName, boolean inlineFiltering) {
     super(collectionName, inlineFiltering);
  
     File outputFile = new File(Environment.getValidSearchHome(), "conf" + File.separator + collectionName + File.separator + "external_metadata.cfg");
     output = new PrintWriter(new FileWriter(outputFile));
   }
  
   // We filter all documents
   public Boolean isDocumentFilterable(String documentType) {
     return true;
   }
  
   // Take first h1 tag content and put it into an external metadata file
   public String filter(String input, String documentType) {
     // Look for h1 tags case insensitive and ignoring newlines
     def h1Matcher = input =~ /(?is).*<h1>(.*?)<\/h1>.*/;
  
     if (h1Matcher.matches()) {
       // Look for the document's URL
  
       // (!) in inline filtering the DOCHDR is not available. In that
       // case you need to implement a different method, instead of
       // filter(input, documentType) you can implement filter(input, documentType, url)
       def urlMatcher = input =~ /(?is).*<base href="(.*?)">.*/;
  
       if (urlMatcher.matches()) {
         String line = urlMatcher[0][1] + " x:\"" + h1Matcher[0][1] + "\"";
         output.println(line);
       }
     }
  
     // No content changes - return the input
     return input;
   }
  
   public void cleanup() {
     if (output != null) {
       output.flush();
       output.close();
     }
     super.cleanup();
   }
  
   // A main method to allow very basic testing
   public static void main(String[] args) {
     def f = new ExtractMetadataExample("dummy-collection", false);
     println(f.filter("<html><base href=\"example_url\"> \n"
       + "<h1>example metadata</H1>", ".html"));
   }
 }

You can run this script on the command line using:

$SEARCH_HOME/tools/groovy-1.8.6/bin/groovy -cp "$SEARCH_HOME/lib/java/all/*" ExtractMetadataExample.groovy

Another method you may find useful is:

import com.funnelback.common.utils.HTMLUtils;
 
String modifiedHTML = HTMLUtils.insertMetadata(String html, String metadataName, String metadataValue);

which allows you to insert a given metadata name/value pair into some HTML.

JSoup Groovy Filter

Below is an example filter which takes the content of the first H1 tag in the page, and replaces the title element with it using JSoup rather than manipulating the text directly.

JSoup is a library for parsing and manipulating HTML and supports CSS/jQuery style selectors for finding elements in the parsed HTML document, which may be substantially simpler than trying to work with regular expressions.

package com.funnelback.common.filter;
  
 import org.jsoup.Jsoup;
 import org.jsoup.nodes.Document;
 import org.jsoup.nodes.Element;
 import org.jsoup.select.Elements;
  
 @groovy.transform.InheritConstructors
 public class JsoupExample extends com.funnelback.common.filter.ScriptFilterProvider {
  
   // We filter all documents
   public Boolean isDocumentFilterable(String documentType) {
     return true;
   }
  
   // Take first h1 tag content and put it into the title
   public String filter(String input, String documentType) {
     Document doc = Jsoup.parse(input);
  
     Elements h1s = doc.select("h1");
     Element h1 = h1s.first();
  
     // Replace the title with the content of the first h1
     doc.select("title").first().html(h1.text());
  
    return doc.outerHtml();
   }
  
   // A main method to allow very basic testing
   public static void main(String[] args) {
     def f = new JsoupExample("dummy-collection", false);
     println(f.filter("<html>\n <head>\n  <title>bad title</title>\n </head>\n <body>\n  foo  and  \n  <h1>good title</h1> end\n </body>\n</html>", "html"));
   }
 }

You can run this script on the command line using:

$SEARCH_HOME/tools/groovy-1.8.6/bin/groovy -cp "$SEARCH_HOME/lib/java/all/*" JsoupExample.groovy

top ⇑