Upgrading custom and local collections

Custom and local collections were used in previous version of Funnelback to index upsupported data source types, or simple data sources where you downloaded and processed a single or small set of static files.

This is not permitted in the DXP as it is a security risk, and it also prevents automatic upgrades of the search.

This guide outlines the process you need to follow when upgrading custom or local collections.

Local collections

Local collections are used to index data in-place that is stored on the local filesystem. The data is not processed in any way so local collections don’t run any filters.

Local collections were often historically used to download content using custom workflow (e.g. bash or perl scripting) that would gather and process any content that was downloaded.

Local collections can usually be replaced with a web data source, but the replacement will depend on what the custom workflow is doing.

If the custom workflow is just downloading some files then this should be set up as a web data source with appropriate start URLs and include/exclude rules if applicable.

If custom processing is performed then appropriate filters/plugins should be configured to replace the custom processing.

In some cases a local collection will need to be replaced with a custom collection that uses a plugin-supplied gatherer. This is typically the case if the local collection is indexing a known but unsupported data source type like Instagram or Google Calendar.

Custom collections

Custom collections in v15 had to implement custom gathering logic using a custom_gather.groovy script. The purpose of this script was to implement whatever logic was required to connect to and download content from the desired data source. Sometimes the gatherer would also perform processing logic as well, which would normally be processed using filters.

In v15 custom gatherers supplied in a groovy file are not supported and must be replaced with a custom gatherer plugin, such as the Instagram or SFTP gathering plugin.

Some custom gathers implemented complicated logic for gathering from custom APIs. This should be replaced with a web data source if possible, which may require extended link extraction to be enabled to crawl the content.

Some very old custom collections are used to gather content from supported social media sources like YouTube - these should be converted to the current data supported data source type.

If a web data source can’t be used (e.g. because you need multiple API calls to fetch a single page, or you need to use an unsupported type of authentication) then you will need to look at either a new plugin, or using the DXP integrations or Job Runner service.