Custom gatherer (custom_gather.groovy) development

This feature is not available to users of the Squiz Experience Cloud version of Funnelback. Equivalent functionality is available using plugins.

A custom gatherer implements the gathering logic for a custom collection. It can be used to gather content from an arbitrary data source using custom logic such as interacting with an API or using an SDK or Java libraries.

A custom gatherer is implemented in one of the following ways:

  • Writing Java code that implements the PluginGatherer interface of a plugin.

    Writing a plugin is the recommended method of implementing a custom gatherer.
  • Writing Groovy code to directly implement a custom_gather.groovy configuration file for a custom collection.

Custom gatherer design

  • Consider what a search result should be and investigate the available APIs to see what steps will be required to get to the result level data. This will often involve several calls to get the required data. It may also involve writing code to follow paginated API responses.

  • Consider what type of store is suitable for the custom gatherer - custom gatherers can operate in a similar manner to a normal collection storing the data in a warc file or can be configured to use a push collection to store the data. Using a push collection is suitable for collections that will be incrementally updated whereas the more standard warc stores are better if you fully update the data each time as you have an offline and live view of the data.

  • If working with a push collection consider how the initial gather will be done and how this might differ from the incremental updates. When working with push collections it is vital to think about how items will be updated and removed.

  • Start by implementing basic functionality in your custom gatherer and then iteratively enhance this.

custom_gather.groovy

The custom_gather.groovy file allows a custom collection’s gathering logic to be implemented. The following example provides a basic template for new custom_gather.groovy files.

Please note that the libraries used in this example are subject to change between versions of Funnelback, and so some effort may be required to upgrade custom_gather.groovy scripts between Funnelback versions. The libraries available from within this scripts are the one in $SEARCH_HOME/lib/java/all/.

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore();

store.open();
try {
    // Loop here to fetch and store each desired record

    def record = new RawBytesRecord(
            "<html><p>Hello, world</p></html>".getBytes("UTF-8"), // Convert the content to utf-8 bytes
            "http://example.com/"); // set the URI to store the document as.

    def metadata = ArrayListMultimap.create();
    // Set the correct Content-Type of the record.
    metadata.put("Content-Type", "text/html; charset=UTF-8");

    store.add(record, metadata);
} finally {
    // close() required for the store to be flushed
    store.close();
}

custom_gather.groovy should expect to be called with two command line arguments, first the location of the Funnelback installation (e.g. /opt/funnelback) and second the name of the collection for which gathering should occur.

As shown in the example, a store object representing the offline data directory can be created, and new records added to, a process which would normally be implemented within some custom loop which causes new records to be gathered from the appropriate source repository.

Reading configuration settings

The config object created in the example above represents both the collection’s configuration and the global Funnelback configuration. The config.value(settingName, defaultValue) call will return the value of the settingName from collection.cfg or global.cfg, allowing the gathering process to be configured for the specific collection.

Halting gathering

The config.isUpdateStopped() call will report whether the user has requested that Funnelback stop the currently running update. It is good practice to monitor this value regularly during gathering and to gracefully stop the gathering process if the user requests it. If this value is not monitored and handled appropriately the custom_gather.groovy script will continue uninterrupted and the update will instead be halted after the script completes.

e.g. Check every 100 iterations and exit if a stop is requested, otherwise just print out a status update indicating how many records have been processed.

if ((i % 100) == 0)
{
    // check to see if the update has been stopped by a user
    if (config.isUpdateStopped()) {
            store.close()
            throw new RuntimeException("Update stop requested by user.");
    }
    // Print status message
    config.setProgressMessage("Processed "+i+" records");
}

Preserving old content

In some cases it may be useful to have a custom collection which always begins with the content from the previous successful update (rather than always gathering everything from scratch). The easiest way to achieve this is to copy all the content from the live view to the offline view at the beginning of the script. Example code for doing so is included below.

def offlineData = new File(args[0], "data" + File.separator + args[1] + File.separator + "offline" + File.separator + "data");
def liveData = new File(args[0], "data" + File.separator + args[1] + File.separator + "live" + File.separator + "data");
org.apache.commons.io.FileUtils.copyDirectory(liveData, offlineData);

Administration interface update status messages

The message that is displayed in the administration interface can be updated from a custom gatherer by calling functions that are part of the configuration object that is created when starting your custom gatherer:

def config = new NoOptionsConfig(new File(args[0]), args[1]);

Set the update status

The update status message is set using the public void setProgressMessage(String message) method. e.g.

// Update the progress message for every 100 records processed
if ((i % 100) == 0)
{
    config.setProgressMessage("Processed "+i+" records");
}

Get the current status

The current status message is read using the public String getProgressMessage() method. e.g.

currentState = config.setProgressMessage();

Clear the status message

The status message can be cleared using the public void clearProgressMessage() method. e.g.

config.clearProgressMessage();

Document filtering

Custom collections can use the filter framework to convert and manipulate documents before they are stored.

When working with a push collection store recommended practice is to follow the process below and filter within the custom collection (rather than specifying a filter parameter when pushing content).

In order to use the filter framework:

  1. Ensure that the withFilteringEnabled(true) is specified when creating the store.

     // Create a Store instance to store gathered data
     def store = new RawBytesStoreFactory(config)
                     .withFilteringEnabled(true) // enable filtering.
                     .newStore();
  2. Set the content type value and charset in the metadata. This is used by the filter framework to make decisions on whether or not to apply a filter.

     def metadata = ArrayListMultimap.create();
     // Set the Content-Type with a charset for the item.
     metadata.put("Content-Type", "application/json; charset=utf-8");
  3. Configure the filters to run by setting filter.classes in collection.cfg

    e.g. to apply custom filtering to the JSON then, the JSON to XML filter to the stored content:

     filter.classes=SomeCustomJSONFilter:JSONToXML

Using external dependencies

External dependencies can be imported via Grapes/Grab as for the filter framework. See: Importing external dependencies for use with Groovy scripts

Using additional/custom jar files

If the jar file is available via the Grapes/Grab framework then use this to import the dependency (see above).

The contents of additional jar files can be added to the collection’s @groovy folder.

The jar files must be unpacked into:

$SEARCH_HOME/conf/COLLECTION/@groovy

They should then be automatically detected within the class path.

To unpack the .jar file rename it to .zip and then unzip the file into the @groovy folder.

Implementing reusable logic

Groovy classes to be used across an installation of Funnelback can be placed in the $SEARCH_HOME/lib/java/groovy directory. Any classes within that directory will be available to be imported into the custom_gather.groovy script and used from within it.

Recommended practice for custom_gather.groovy scripts is to keep the script itself as small/simple as possible, creating separate reusable classes as described above to perform the main gathering tasks.

Storing XML content

It is possible to store XML content when using the org.w3c.dom.Document object. Here is an example of doing that:

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;

// Create a configuration object to read collection.cfg
Config config =  new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore()
                .asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects

store.open();
try {
    // Loop here to fetch and store each desired record

    org.w3c.dom.Document xml = XMLUtils.fromString("<doc><url>http://example.com</url><title>Example</title></doc>");
    def record = new XmlRecord(xml, "http://example.com/");

    // No need to set Content-Type as it will be set by the store.
    def metadata = ArrayListMultimap.create();

    store.add(record, metadata);
} finally {
    // close() required for the store to be flushed
    store.close();
}

Storing content in a push collection

Storage of content within a push collection requires the following options to be set in collection.cfg. No specific updates to the custom_gather.groovy are required. The push collection must exist and doesn’t have to be on the same Funnelback server.

The following settings are sufficient to store documents into a push collection on the local machine.

store.raw-bytes.class=com.funnelback.common.io.store.bytes.Push2Store
store.push.collection=<Name of the push collection>

If the push API is available on another server you may specify the location using:

store.push.url=https://SERVER:<admin port>/push-api/

By default the push service user will be used, if the remote server does not share the same server secret then user account details of a Funnelback user on the remote server must be specified. This is done with:

store.push.user=<Remote user name>
store.push.password=<Remote user password>

Several update phases should be disabled as the custom collection is only used to gather content. If these are not update the collection update will fail due to no content being stored within the collection.

Add the following to the collection.cfg to disable all the update phases except for the gather phase:

index=false
report=false
archive=false
meta_dependencies=false
swap=false

When working with a push collection the gather code must account for the following:

  • The code must handle any errors that are returned by the push collection (such as the service being unavailable) and appropriately retry or queue requests.

  • The custom gatherer may need to delete documents as push collections are always added to when an update is run. This can be done by calling store.delete(String key) to remove a specific document.

See: push collections for general information on push collections.

Troubleshooting

Cache copies don’t work

For cache copies to work collection.cfg should have:

store.record.type=RawBytesRecord

You must also be using the RawBytesStoreFactory. If you are using XmlStoreFactory you can generally replace:

def store = new XmlStoreFactory(config).newStore();

with

def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // You may want to disable filtering.
                .newStore()
                .asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects

© 2015- Squiz Pty Ltd