Gather and index content from an unsupported data source

There are three supported methods of extending Funnelback to index content from an unsupported data source.

  • Custom gather plugin: Java code that connects to and gathers content from the unsupported data source.

  • Squiz Connect: Uses a Squiz Connect recipe to gather content from the unsupported system and submit it to a push index.

  • Push API: Enables content to be submitted directly from the unsupported system to a push index.

Implementing a custom gatherer plugin

A custom gatherer implements the gathering logic for a custom data source. It can be used to gather content using custom logic such as interacting with an API or using Java libraries.

A custom gatherer is created by writing Java code that implements a custom gatherer plugin.

What does a custom gatherer do?

At a basic level, a custom gatherer is responsible for performing the following functions:

  • Connecting to the repository that houses the data that you wish to index (including any authentication requirements).

  • Requesting and fetching the raw data that you wish to index.

  • Scanning this data for security risks (If gathered data is in the form of a file i.e. SFTP, FTP, file download, etc.).

  • Storing this data along with a unique URI.

A custom gatherer should also consider and implement the following:

  • Provide indexer configuration for setting up default metadata mappings.

  • Provide faceted navigation configuration for any useful default facets.

See: Gather and index update flows for background information on how the update process works.

A custom gatherer should not perform the following operations:

  • Data manipulation or conversion (this should be handled using a filter plugin), unless the manipulation is so specific to the custom gatherer that it is never going to be useful as a standalone filter for other searches.

Custom gatherer design

  • Consider what information constitutes a search result, and investigate the available APIs or methods to see what steps will be required to get to the result level data. This will often involve several calls to get the required data. It may also involve writing code to follow paginated API responses.

  • Start by implementing basic functionality in your custom gatherer and then iteratively enhance this.

Custom gatherer should only implement gather logic

When writing a custom gatherer ensure that it only implements the logic required to gather the data from the data source.

If your plugin requires any data transformation, this can be implemented using one of the plugin filter interfaces.

This should be encapsulated in the same plugin if the transformation is very specific to the data source that your plugin integrating with.

However, if the transformation work is generic in nature (such as modifying metadata values) and there isn’t a pre-existing plugin that performs the transformation, then you should separate this into a separate plugin so that this can be used as a standalone plugin for other searches.

Custom gatherer implementation

A custom gatherer plugin is developed by writing Java code that implements the PluginGatherer interface of the plugin framework.

Plugin scopes

The plugin scope for a plugin that implements custom gathering must include the runs on datasource scope.

Maven archetype template options

A custom gatherer plugin template can be generated using the Maven archetype template

  • When using the interactive mode of Maven archetype template:

    • type true for Define value for property 'gatherer'

    • type true for Define value for property 'runs-on-datasource'

  • When using the non-interactive mode of Maven archetype template:

    • set the flag -Dgatherer=true

    • set the flag -Druns-on-datasource=true

When creating a custom gatherer it is likely you will want to enable indexing templates to store metadata mappings.

Interface methods

The PluginGatherer interface must be implemented for a plugin to provide gathering functionality.

The PluginGatherer requires one of the following two methods to be implemented. The one chosen will depend on the type of data being gathered and the trustworthiness of that data.

If the data gathered is in the form of a file e.g. SFTP or file download, or the data is gathered using an untrustworthy API, then the following method should be implemented to ensure the virus scanner is called.

void gather(PluginGatherContext pluginGatherContext, PluginStore store, FileScanner fileScanner) throws Exception;

Where:

  • pluginGatherContext provides access to SEARCH_HOME and data source configuration settings.

  • store defines where the gathered documents are stored.

  • fileScanner can be used to scan processed file(s). For example: check for viruses, if Virus Scanning plugin is enabled.

If the data is gathered using a trustworthy API the following method should be implemented. This method does not apply the virus scanner.

void gather(PluginGatherContext pluginGatherContext, PluginStore store) throws Exception;

Where:

  • pluginGatherContext provides access to SEARCH_HOME and data source configuration settings.

  • store defines where the gathered documents are stored.

Configuring a custom gatherer

Custom gatherer plugins can be configured by:

  • setting custom configuration keys, set in the data source configuration.

  • defining a custom configuration file for use with the plugin.

See:

Accessing configuration from a custom gatherer

Configuration keys and files are accessed via the PluginGatherContext object available within the custom gatherer plugin’s gather() method.

The following code fragment reads a configuration key CALENDAR_ID from configuration and performs some validation on the key.

@Override
    public void gather(PluginGatherContext pluginGatherContext, PluginStore store) throws Exception {
        ...

        // Read a String configuration key
        String calendarId = Optional.ofNullable(pluginGatherContext.getConfigSetting(pluginUtils.CALENDAR_ID.getKey())).orElseThrow(() -> new IllegalArgumentException("Plugin configuration required parameter missing - " + pluginUtils.CALENDAR_ID.getKey()));
        if (calendarId.isBlank()) {
            throw new IllegalArgumentException(GoogleCalendarPluginGatherer.PLUGIN_CONFIGURATION_ERROR + pluginUtils.CALENDAR_ID.getKey() + " value is required");
        }

        // Read an integer configuration key
        int maxPastDay;
        try{
            maxPastDay = Integer.parseInt(Optional.ofNullable(pluginGatherContext.getConfigSetting(pluginUtils.FROM_DAYS_BACK.getKey())).orElse("-1"));
        }
        catch (NumberFormatException ex){
            throw new IllegalArgumentException(GoogleCalendarPluginGatherer.PLUGIN_CONFIGURATION_ERROR + pluginUtils.FROM_DAYS_BACK.getKey() + " must be an integer");
        }
        ...
    }

Usage

A plugin’s gatherer is used by creating a custom data source that enables the custom gathering plugin.

Example: MockGatherer plugin

The MockGatherer plugin implements a PluginGatherer that fetches a number of URLs via HTTP and stores the documents.

MockGatherer plugin source code: MockGatheringPluginGatherer.java
package com.example.mockgathering;

import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Stack;

import java.net.URI;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import com.example.mockgathering.remotefetch.RealRemoteFetcher;
import com.example.mockgathering.remotefetch.RemoteFetcher;
import com.funnelback.plugin.gatherer.PluginGatherContext;
import com.funnelback.plugin.gatherer.PluginGatherer;
import com.funnelback.plugin.gatherer.PluginStore;

/**
* The gather here will use the object in field "remoteFetcher" for actually
* fetching documents. The test for this class will show how it is possible to
* change the implementation of "remoteFetcher" such that it uses our own dummy
* mock class which does not make real HTTP requests.
*/
public class MockGatheringPluginGatherer implements PluginGatherer {

    private static final Logger log = LogManager.getLogger(MockGatheringPluginGatherer.class);

    // By default we use the real fetcher, under test will we change this.
    public RemoteFetcher remoteFetcher = new RealRemoteFetcher();

    @Override
    public void gather(PluginGatherContext pluginGatherContext, PluginStore store, FileScanner fileScanner) throws Exception {

        // Get the security-token from config, this will be supplied in each request.
        Map<String, String> headers = new HashMap<>();
        headers.put("security-token", pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX + "security-token"));

        Stack<URI> urlsToFetch = new Stack<>();

        // Get the start URLs which is a space separated list.
        Arrays.asList(pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX+"start-urls").split(" "))
            .stream()
            .filter(s -> !s.isBlank()) // remove blank/empty strings.
            .map(URI::create) // convert all of the strings to URLs
            .forEach(urlsToFetch::add); // add all the urls to the urlsToFetch array list.

        while(!urlsToFetch.isEmpty()) {
            ArrayListMultimap<String, String> metadata = ArrayListMultimap.create();

            URI url = urlsToFetch.pop();
            String fetchedContent = remoteFetcher.get(url, headers);
            // Scan document for viruses
            if (virusScanner.checkbytes(fetchedContent.getBytes())) {
                // Store the document
                store.store(
                        url,
                        fetchedContent.getBytes(StandardCharsets.UTF_8),
                        metadata
                );

                List<URI> moreUrlsToFetch = new FetchedResultParser().parseFetchedContent(url, fetchedContent, store);
                urlsToFetch.addAll(moreUrlsToFetch);
            } else {
                throw new RuntimeException("File " + url + " has failed virus scanning");
            }
        }
        log.debug("Gathering documents");
    }
}
MockGatherer plugin test code: MockGatheringPluginGathererTest.java
package com.example.mockgathering;

import java.util.HashMap;
import java.util.Map;

import org.junit.Assert;
import org.junit.Test;

import java.net.URI;

import com.example.mockgathering.remotefetch.RemoteFetcher;
import com.funnelback.plugin.gatherer.mock.MockPluginGatherContext;
import com.funnelback.plugin.gatherer.mock.MockPluginStore;
import com.funnelback.plugin.gatherer.mock.MockFileScanner;

public class MockGatheringPluginGathererTest {

    /**
     * This is our mock remote fetcher, this is what we will use to stop making
     * HTTP requests in our test.
     *
     * This could have been in its own file like a normal class.
     */
    public static class MockRemoteFetcher implements RemoteFetcher {

        /**
         * When a URI is requested we will look for it in this map.
         */
        public Map<URI, String> uriToResult = new HashMap<>();

        @Override
        public String get(URI url, Map<String, String> headers) {
            // In our mock fetcher we ensure that the security-token is set.
            Assert.assertEquals("The security token was not set.", "password2", headers.get("security-token"));

            if(uriToResult.containsKey(url)) {
                return uriToResult.get(url);
            }

            throw new RuntimeException("404 could not find: " + url);
        }
    }

    /**
     * This shows how we mock making real web requests.
     */
    @Test
    public void testCustomGatherPlugin() throws Exception {
        MockPluginGatherContext mockContext = new MockPluginGatherContext();
        MockPluginStore mockStore = new MockPluginStore();
        MockFileScanner mockFileScanner = new MockFileScanner();

        // Configure the required config settings for the test:
        mockContext.setConfigSetting("plugin.mock-gathering.security-token", "password2");
        mockContext.setConfigSetting("plugin.mock-gathering.start-urls", "http://example.com/1");

        // Here we also create our mock RemoteFetcher, this way we don't make HTTP requests in our test.
        MockRemoteFetcher mockRemoteFetcher = new MockRemoteFetcher();

        // Lets add some data to our remote fetcher:
        mockRemoteFetcher.uriToResult.put(URI.create("http://example.com/1"), "{\n" +
            "   \"content\":\"hello1\",\n" +
            "   \"next\":\"http://example.com/2\"\n" +
            "}");

        mockRemoteFetcher.uriToResult.put(URI.create("http://example.com/2"), "{\n" +
            "   \"content\":\"hello2\",\n" +
            "   \"next\": null\n" +
            "}");

        // Set file scanner to return valid file
        mockFileScanner.setCheckResults(true);

        MockGatheringPluginGatherer underTest = new MockGatheringPluginGatherer();

        // It is important to actually tell our gather which RemoteFetcher we want it to use.
        // use our mock RemoteFetcher.
        underTest.remoteFetcher = mockRemoteFetcher;

        // Now run it, it will use our movk remote fetcher so we can test it without needing a real server.
        underTest.gather(mockContext, mockStore, mockFileScanner);

        Assert.assertEquals("Check how many documents were stored.", 2, mockStore.getStored().size());
    }
}

Example: Custom gatherer plugin

The custom gatherer plugin generates a set of documents from data source configuration settings.

Custom gatherer plugin source code: CustomGathererPluginGatherer.java
package com.example.customgatherer;

import com.funnelback.plugin.gatherer.PluginGatherContext;
import com.funnelback.plugin.gatherer.PluginGatherer;
import com.funnelback.plugin.gatherer.PluginStore;
import com.google.common.collect.ArrayListMultimap;

import java.net.URI;
import java.nio.charset.StandardCharsets;

/**
* Demonstrates using a plugin to create documents in gathering phase.
* This plugin reads from `collection.cfg`, creates documents and adds some metadata to those documents.
*/
public class CustomGathererPluginGatherer implements PluginGatherer {

    @Override
    public void gather(PluginGatherContext pluginGatherContext, PluginStore store, FileScanner fileScanner) throws Exception {

        // Read from collection.cfg
        int docsToMake = Integer.parseInt(pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX + "number-of-documents-to-make"));
        String documentUrl = pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX + "document-url");

        for(int i = 0; i< docsToMake; i++) {
            ArrayListMultimap<String, String> metadata = ArrayListMultimap.create();
            // Add metadata
            metadata.put("Content-Type", "text/html; charset=UTF-8");
            metadata.put("total-docs", String.valueOf(docsToMake));
            metadata.put("this-doc-number", String.valueOf(i));
            // Store the documents
            store.store(
                new URI(documentUrl + i),
                "Hello world!".getBytes(StandardCharsets.UTF_8),
                metadata
            );
        }
    }
}
MockGatherer plugin test code: MockGatheringPluginGathererTest.java
package com.example.customgatherer;

import com.funnelback.plugin.gatherer.mock.MockPluginGatherContext;
import com.funnelback.plugin.gatherer.mock.MockPluginStore;
import com.funnelback.plugin.gatherer.mock.MockFileScanner;
import org.junit.Assert;
import org.junit.Test;

import java.util.List;

public class CustomGathererPluginGathererTest {

    @Test
    public void testCustomGatherPlugin() throws Exception {

        MockPluginGatherContext mockContext = new MockPluginGatherContext();
        MockPluginStore mockStore = new MockPluginStore();
        MockFileScanner mockFileScanner = new MockFileScanner();

        CustomGathererPluginGatherer underTest = new CustomGathererPluginGatherer();

        // Set the collection.cfg settings
        mockContext.setConfigSetting(PluginUtils.KEY_PREFIX + "number-of-documents-to-make", "2");
        mockContext.setConfigSetting(PluginUtils.KEY_PREFIX + "document-url", "http://www.example.com/");

        // As the plugin gatherer is likely to interact with an external system you may need
        // to mock those interactions out. Until that is done you can still use this test to
        // try out your gatherer locally.
        underTest.gather(mockContext, mockStore, mockFileScanner);

        // This list holds the result of what the plugin class.
        List<MockPluginStore.MockPluginStoreResult> resultList = mockStore.getStored();

        // Now check each of the results ensuring that the correct values were set.
        Assert.assertTrue("Check how many documents were stored.", mockStore.getStored().size() >= 0);
        Assert.assertEquals("Document URL should be taken from the collection.cfg","http://www.example.com/" + 1, resultList.get(1).getUri().toString());
        Assert.assertEquals("Document should contain some content","Hello world!", new String(resultList.get(1).getContent()));
        Assert.assertEquals("Content-Type should be set properly","text/html; charset=UTF-8", resultList.get(1).getMetadata().get("Content-Type").get(0));
        Assert.assertEquals("total-docs metadata should be set to 2","2", resultList.get(1).getMetadata().get("total-docs").get(0));
        Assert.assertEquals("this-doc-number metadata should be set to 1",String.valueOf(1), resultList.get(1).getMetadata().get("this-doc-number").get(0));
    }
}

Logging

Log messages from the gather method will appear in the data source’s gatherer logs.