Custom gatherer (custom_gather.groovy) development
- Custom gatherer design
- custom_gather.groovy
- Troubleshooting
This feature is not available to users of the Squiz Experience Cloud version of Funnelback. Equivalent functionality is available using plugins. |
A custom gatherer implements the gathering logic for a custom collection. It can be used to gather content from an abritrary data source using custom logic such as interacting with and API or using an SDK or Java libraries.
A custom gatherer is implemented in one of the following ways:
-
Writing Java code that implements the
PluginGatherer
interface of a plugin.Writing a plugin is the recommended method of implementing a custom gatherer. -
Writing Groovy code to directly implement a
custom_gather.groovy
configuration file for a custom collection.
Custom gatherer design
-
Consider what a search result should be and investigate the available APIs to see what steps will be required to get to the result level data. This will often involve several calls to get the required data. It may also involve writing code to follow paginated API responses.
-
Consider what type of store is suitable for the custom gatherer - custom gatherers can operate in a similar manner to a normal collection storing the data in a warc file or can be configured to use a push collection to store the data. Using a push collection is suitable for collections that will be incrementally updated whereas the more standard warc stores are better if you fully update the data each time as you have an offline and live view of the data.
-
If working with a push collection consider how the initial gather will be done and how this might differ from the incremental updates. When working with push collections it is vital to think about how items will be updated and removed.
-
Start by implementing basic functionality in your custom gatherer and then iteratively enhance this.
custom_gather.groovy
The custom_gather.groovy
file allows a custom collection’s gathering logic to be implemented. The following example provides a basic template for new custom_gather.groovy
files.
Please note that the libraries used in this example are subject to change between versions of Funnelback, and so some effort may be required to upgrade custom_gather.groovy
scripts between Funnelback versions. The libraries available from within this scripts are the one in $SEARCH_HOME/lib/java/all/
.
import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;
// Create a configuration object to read collection.cfg
Config config = new NoOptionsConfig(new File(args[0]), args[1]);
// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
.withFilteringEnabled(true) // enable filtering.
.newStore();
store.open();
try {
// Loop here to fetch and store each desired record
def record = new RawBytesRecord(
"<html><p>Hello, world</p></html>".getBytes("UTF-8"), // Convert the content to utf-8 bytes
"http://example.com/"); // set the URI to store the document as.
def metadata = ArrayListMultimap.create();
// Set the correct Content-Type of the record.
metadata.put("Content-Type", "text/html; charset=UTF-8");
store.add(record, metadata);
} finally {
// close() required for the store to be flushed
store.close();
}
custom_gather.groovy
should expect to be called with two command line arguments, first the location of the Funnelback installation (e.g. /opt/funnelback
) and second the name of the collection for which gathering should occur.
As shown in the example, a store object representing the offline data directory can be created, and new records added to, a process which would normally be implemented within some custom loop which causes new records to be gathered from the appropriate source repository.
Reading configuration settings
The config
object created in the example above represents both the collection’s configuration and the global Funnelback configuration. The config.value(settingName, defaultValue)
call will return the value of the settingName
from collection.cfg or global.cfg, allowing the gathering process to be configured for the specific collection.
Halting gathering
The config.isUpdateStopped()
call will report whether the user has requested that Funnelback stop the currently running update. It is good practice to monitor this value regularly during gathering and to gracefully stop the gathering process if the user requests it. If this value is not monitored and handled appropriately the custom_gather.groovy
script will continue uninterrupted and the update will instead be halted after the script completes.
e.g. Check every 100 iterations and exit if a stop is requested, otherwise just print out a status update indicating how many records have been processed.
if ((i % 100) == 0)
{
// check to see if the update has been stopped by a user
if (config.isUpdateStopped()) {
store.close()
throw new RuntimeException("Update stop requested by user.");
}
// Print status message
config.setProgressMessage("Processed "+i+" records");
}
Preserving old content
In some cases it may be useful to have a custom collection which always begins with the content from the previous successful update (rather than always gathering everything from scratch). The easiest way to achieve this is to copy all the content from the live view to the offline view at the beginning of the script. Example code for doing so is included below.
def offlineData = new File(args[0], "data" + File.separator + args[1] + File.separator + "offline" + File.separator + "data");
def liveData = new File(args[0], "data" + File.separator + args[1] + File.separator + "live" + File.separator + "data");
org.apache.commons.io.FileUtils.copyDirectory(liveData, offlineData);
Administration interface update status messages
The message that is displayed in the administration interface can be updated from a custom gatherer by calling functions that are part of the configuration object that is created when starting your custom gatherer:
def config = new NoOptionsConfig(new File(args[0]), args[1]);
Set the update status
The update status message is set using the public void setProgressMessage(String message)
method. e.g.
// Update the progress message for every 100 records processed
if ((i % 100) == 0)
{
config.setProgressMessage("Processed "+i+" records");
}
Document filtering
Custom collections can use the filter framework to convert and manipulate documents before they are stored.
When working with a push collection store recommended practice is to follow the process below and filter within the custom collection (rather than specifying a filter parameter when pushing content).
In order to use the filter framework:
-
Ensure that the
withFilteringEnabled(true)
is specified when creating the store.// Create a Store instance to store gathered data def store = new RawBytesStoreFactory(config) .withFilteringEnabled(true) // enable filtering. .newStore();
-
Set the content type value and charset in the metadata. This is used by the filter framework to make decisions on whether or not to apply a filter.
def metadata = ArrayListMultimap.create(); // Set the Content-Type with a charset for the item. metadata.put("Content-Type", "application/json; charset=utf-8");
-
Configure the filters to run by setting
filter.classes
incollection.cfg
e.g. to apply custom filtering to the JSON then, the JSON to XML filter to the stored content:
filter.classes=SomeCustomJSONFilter:JSONToXML
Using external dependencies
External dependencies can be imported via Grapes/Grab as for the filter framework. See: Importing external dependencies for use with Groovy scripts
Using additional/custom jar files
If the jar file is available via the Grapes/Grab framework then use this to import the dependency (see above).
The contents of additional jar files can be added to the collection’s @groovy
folder.
The jar files must be unpacked into:
$SEARCH_HOME/conf/COLLECTION/@groovy
They should then be automatically detected within the class path.
To unpack the .jar
file rename it to .zip
and then unzip the file into the @groovy
folder.
Implementing reusable logic
Groovy classes to be used across an installation of Funnelback can be placed in the $SEARCH_HOME/lib/java/groovy
directory. Any classes within that directory will be available to be imported into the custom_gather.groovy
script and used from within it.
Recommended practice for custom_gather.groovy
scripts is to keep the script itself as small/simple as possible, creating separate reusable classes as described above to perform the main gathering tasks.
Storing XML content
It is possible to store XML content when using the org.w3c.dom.Document
object. Here is an example of doing that:
import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;
// Create a configuration object to read collection.cfg
Config config = new NoOptionsConfig(new File(args[0]), args[1]);
// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
.withFilteringEnabled(true) // enable filtering.
.newStore()
.asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects
store.open();
try {
// Loop here to fetch and store each desired record
org.w3c.dom.Document xml = XMLUtils.fromString("<doc><url>http://example.com</url><title>Example</title></doc>");
def record = new XmlRecord(xml, "http://example.com/");
// No need to set Content-Type as it will be set by the store.
def metadata = ArrayListMultimap.create();
store.add(record, metadata);
} finally {
// close() required for the store to be flushed
store.close();
}
Storing content in a push collection
Storage of content within a push collection requires the following options to be set in collection.cfg
. No specific updates to the custom_gather.groovy
are required. The push collection must exist and doesn’t have to be on the same Funnelback server.
The following settings are sufficient to store documents into a push collection on the local machine.
store.raw-bytes.class=com.funnelback.common.io.store.bytes.Push2Store
store.push.collection=<Name of the push collection>
If the push API is available on another server you may specify the location using:
store.push.url=https://SERVER:<admin port>/push-api/
By default the push service user will be used, if the remote server does not share the same server secret then user account details of a Funnelback user on the remote server must be specified. This is done with:
store.push.user=<Remote user name>
store.push.password=<Remote user password>
Several update phases should be disabled as the custom collection is only used to gather content. If these are not update the collection update will fail due to no content being stored within the collection.
Add the following to the collection.cfg
to disable all the update phases except for the gather phase:
index=false report=false archive=false meta_dependencies=false swap=false
When working with a push collection the gather code must account for the following:
-
The code must handle any errors that are returned by the push collection (such as the service being unavailable) and appropriately retry or queue requests.
-
The custom gatherer may need to delete documents as push collections are always added to when an update is run. This can be done by calling
store.delete(String key)
to remove a specific document.
See: push collections for general information on push collections.
Troubleshooting
Cache copies don’t work
For cache copies to work collection.cfg
should have:
store.record.type=RawBytesRecord
You must also be using the RawBytesStoreFactory
. If you are using XmlStoreFactory
you can generally replace:
def store = new XmlStoreFactory(config).newStore();
with
def store = new RawBytesStoreFactory(config)
.withFilteringEnabled(true) // You may want to disable filtering.
.newStore()
.asXmlStore(); // change the store to accept XML org.w3c.dom.Document objects