Gather and index content from an unsupported data source
There are three supported methods of extending Funnelback to index content from an unsupported data source.
-
Custom gather plugin: Java code that connects to and gathers content from the unsupported data source.
-
Squiz Connect: Uses a Squiz Connect recipe to gather content from the unsupported system and submit it to a push index.
-
Push API: Enables content to be submitted directly from the unsupported system to a push index.
Implementing a custom gatherer plugin
A custom gatherer implements the gathering logic for a custom data source. It can be used to gather content using custom logic such as interacting with an API or using Java libraries.
A custom gatherer is created by writing Java code that implements a custom gatherer plugin.
What does a Funnelback custom gatherer do?
At a basic level, a custom gatherer is responsible for performing the following functions:
-
Connecting to the repository that houses the data that you wish to index (including any authentication requirements).
-
Requesting and fetching the raw data that you wish to index.
-
Storing this data along with a unique URI.
A custom gatherer should also consider and implement the following:
-
Provide indexer configuration for setting up default metadata mappings.
-
Provide faceted navigation configuration for any useful default facets.
See: Gather and index update flows for background information on how the update process works.
A custom gatherer should not perform the following operations:
-
Data manipulation or conversion (this should be handled using a filter plugin).
Custom gatherer design
-
Consider what information constitutes a search result, and investigate the available APIs or methods to see what steps will be required to get to the result level data. This will often involve several calls to get the required data. It may also involve writing code to follow paginated API responses.
-
Start by implementing basic functionality in your custom gatherer and then iteratively enhance this.
Custom gatherer implementation
A custom gatherer plugin is developed by writing Java code that implements the PluginGatherer
interface of the Funnelback plugin framework.
Plugin scopes
The plugin scope for a plugin that implements custom gathering must include the runs on datasource scope.
Maven archetype template options
A custom gatherer plugin template can be generated using the Maven archetype template
-
When using the interactive mode of Maven archetype template:
-
type
true
forDefine value for property 'gatherer'
-
type
true
forDefine value for property 'runs-on-datasource'
-
-
When using the non-interactive mode of Maven archetype template:
-
set the flag
-Dgatherer=true
-
set the flag
-Druns-on-datasource=true
-
When creating a custom gatherer it is likely you will want to enable indexing templates to store metadata mappings. |
Interface methods
The PluginGatherer interface must be implemented for a plugin to provide gathering functionality.
The PluginGatherer
interface has a single method:
void gather(PluginGatherContext pluginGatherContext, PluginStore store) throws Exception;
Where:
-
pluginGatherContext
provides access toSEARCH_HOME
and data source configuration settings. -
store
defines where the gathered documents are stored.
Configuring a custom gatherer
Custom gatherer plugins can be configured by:
-
setting custom configuration keys, set in the data source configuration.
-
defining a custom configuration file for use with the plugin.
Usage
A plugin’s gatherer is used by creating a custom data source that enables the custom gathering plugin.
Example: MockGatherer plugin
The MockGatherer plugin implements a PluginGatherer that fetches a number of URLs via HTTP and stores the documents.
MockGatheringPluginGatherer.java
package com.example.mockgathering;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Stack;
import java.net.URI;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.example.mockgathering.remotefetch.RealRemoteFetcher;
import com.example.mockgathering.remotefetch.RemoteFetcher;
import com.funnelback.plugin.gatherer.PluginGatherContext;
import com.funnelback.plugin.gatherer.PluginGatherer;
import com.funnelback.plugin.gatherer.PluginStore;
/**
* The gather here will use the object in field "remoteFetcher" for actually
* fetching documents. The test for this class will show how it is possible to
* change the implementation of "remoteFetcher" such that it uses our own dummy
* mock class which does not make real HTTP requests.
*/
public class MockGatheringPluginGatherer implements PluginGatherer {
private static final Logger log = LogManager.getLogger(MockGatheringPluginGatherer.class);
// By default we use the real fetcher, under test will we change this.
public RemoteFetcher remoteFetcher = new RealRemoteFetcher();
@Override
public void gather(PluginGatherContext pluginGatherContext, PluginStore store) throws Exception {
// Get the security-token from config, this will be supplied in each request.
Map<String, String> headers = new HashMap<>();
headers.put("security-token", pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX + "security-token"));
Stack<URI> urlsToFetch = new Stack<>();
// Get the start URLs which is a space separated list.
Arrays.asList(pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX+"start-urls").split(" "))
.stream()
.filter(s -> !s.isBlank()) // remove blank/empty strings.
.map(URI::create) // convert all of the strings to URLs
.forEach(urlsToFetch::add); // add all the urls to the urlsToFetch array list.
while(!urlsToFetch.isEmpty()) {
URI url = urlsToFetch.pop();
String fetchedContent = remoteFetcher.get(url, headers);
List<URI> moreUrlsToFetch = new FetchedResultParser().parseFetchedContent(url, fetchedContent, store);
urlsToFetch.addAll(moreUrlsToFetch);
}
log.debug("Gathering documents");
}
}
MockGatheringPluginGathererTest.java
package com.example.mockgathering;
import java.util.HashMap;
import java.util.Map;
import org.junit.Assert;
import org.junit.Test;
import java.net.URI;
import com.example.mockgathering.remotefetch.RemoteFetcher;
import com.funnelback.plugin.gatherer.mock.MockPluginGatherContext;
import com.funnelback.plugin.gatherer.mock.MockPluginStore;
public class MockGatheringPluginGathererTest {
/**
* This is our mock remote fetcher, this is what we will use to stop making
* HTTP requests in our test.
*
* This could have been in its own file like a normal class.
*/
public static class MockRemoteFetcher implements RemoteFetcher {
/**
* When a URI is requested we will look for it in this map.
*/
public Map<URI, String> uriToResult = new HashMap<>();
@Override
public String get(URI url, Map<String, String> headers) {
// In our mock fetcher we ensure that the security-token is set.
Assert.assertEquals("The security token was not set.", "password2", headers.get("security-token"));
if(uriToResult.containsKey(url)) {
return uriToResult.get(url);
}
throw new RuntimeException("404 could not find: " + url);
}
}
/**
* This shows how we mock making real web requests.
*/
@Test
public void testCustomGatherPlugin() throws Exception {
MockPluginGatherContext mockContext = new MockPluginGatherContext();
MockPluginStore mockStore = new MockPluginStore();
// Configure the required config settings for the test:
mockContext.setConfigSetting("plugin.mock-gathering.security-token", "password2");
mockContext.setConfigSetting("plugin.mock-gathering.start-urls", "http://example.com/1");
// Here we also create our mock RemoteFetcher, this way we don't make HTTP requests in our test.
MockRemoteFetcher mockRemoteFetcher = new MockRemoteFetcher();
// Lets add some data to our remote fetcher:
mockRemoteFetcher.uriToResult.put(URI.create("http://example.com/1"), "{\n" +
" \"content\":\"hello1\",\n" +
" \"next\":\"http://example.com/2\"\n" +
"}");
mockRemoteFetcher.uriToResult.put(URI.create("http://example.com/2"), "{\n" +
" \"content\":\"hello2\",\n" +
" \"next\": null\n" +
"}");
MockGatheringPluginGatherer underTest = new MockGatheringPluginGatherer();
// It is important to actually tell our gather which RemoteFetcher we want it to use.
// use our mock RemoteFetcher.
underTest.remoteFetcher = mockRemoteFetcher;
// Now run it, it will use our movk remote fetcher so we can test it without needing a real server.
underTest.gather(mockContext, mockStore);
Assert.assertEquals("Check how many documents were stored.", 2, mockStore.getStored().size());
}
}
Example: Custom gatherer plugin
The custom gatherer plugin generates a set of documents from data source configuration settings.
CustomGathererPluginGatherer.java
package com.example.customgatherer;
import com.funnelback.plugin.gatherer.PluginGatherContext;
import com.funnelback.plugin.gatherer.PluginGatherer;
import com.funnelback.plugin.gatherer.PluginStore;
import com.google.common.collect.ArrayListMultimap;
import java.net.URI;
import java.nio.charset.StandardCharsets;
/**
* Demonstrates using a plugin to create documents in gathering phase.
* This plugin reads from `collection.cfg`, creates documents and adds some metadata to those documents.
*/
public class CustomGathererPluginGatherer implements PluginGatherer {
@Override
public void gather(PluginGatherContext pluginGatherContext, PluginStore store) throws Exception {
// Read from collection.cfg
int docsToMake = Integer.parseInt(pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX + "number-of-documents-to-make"));
String documentUrl = pluginGatherContext.getConfigSetting(PluginUtils.KEY_PREFIX + "document-url");
for(int i = 0; i< docsToMake; i++) {
ArrayListMultimap<String, String> metadata = ArrayListMultimap.create();
// Add metadata
metadata.put("Content-Type", "text/html; charset=UTF-8");
metadata.put("total-docs", String.valueOf(docsToMake));
metadata.put("this-doc-number", String.valueOf(i));
// Store the documents
store.store(
new URI(documentUrl + i),
"Hello world!".getBytes(StandardCharsets.UTF_8),
metadata
);
}
}
}
MockGatheringPluginGathererTest.java
package com.example.customgatherer;
import com.funnelback.plugin.gatherer.mock.MockPluginGatherContext;
import com.funnelback.plugin.gatherer.mock.MockPluginStore;
import org.junit.Assert;
import org.junit.Test;
import java.util.List;
public class CustomGathererPluginGathererTest {
@Test
public void testCustomGatherPlugin() throws Exception {
MockPluginGatherContext mockContext = new MockPluginGatherContext();
MockPluginStore mockStore = new MockPluginStore();
CustomGathererPluginGatherer underTest = new CustomGathererPluginGatherer();
// Set the collection.cfg settings
mockContext.setConfigSetting(PluginUtils.KEY_PREFIX + "number-of-documents-to-make", "2");
mockContext.setConfigSetting(PluginUtils.KEY_PREFIX + "document-url", "http://www.example.com/");
// As the plugin gatherer is likely to interact with an external system you may need
// to mock those interactions out. Until that is done you can still use this test to
// try out your gatherer locally.
underTest.gather(mockContext, mockStore);
// This list holds the result of what the plugin class.
List<MockPluginStore.MockPluginStoreResult> resultList = mockStore.getStored();
// Now check each of the results ensuring that the correct values were set.
Assert.assertTrue("Check how many documents were stored.", mockStore.getStored().size() >= 0);
Assert.assertEquals("Document URL should be taken from the collection.cfg","http://www.example.com/" + 1, resultList.get(1).getUri().toString());
Assert.assertEquals("Document should contain some content","Hello world!", new String(resultList.get(1).getContent()));
Assert.assertEquals("Content-Type should be set properly","text/html; charset=UTF-8", resultList.get(1).getMetadata().get("Content-Type").get(0));
Assert.assertEquals("total-docs metadata should be set to 2","2", resultList.get(1).getMetadata().get("total-docs").get(0));
Assert.assertEquals("this-doc-number metadata should be set to 1",String.valueOf(1), resultList.get(1).getMetadata().get("this-doc-number").get(0));
}
}