Filecopy data sources

This feature is not available to users of the Squiz Experience Cloud version of Funnelback.

A filecopy data source is used for indexing documents from a file share or a local disk. It is made from a copy of the documents from a local or remote filesystem directory/folder.


An update will copy new or changed files from the source folder into the data source’s offline data directory from where the update will proceed as normal. Binary documents are converted into text, text content is indexed, and the offline view is swapped with the live view.

Create the data source

Filecopy data sources are created by following the data source creation steps and selecting filecopy from the list of data source types.

A filecopy collection is defined by the following properties:

Supported directories

Funnelback supports the indexing of various different types of directory. These include:

Local directories

These are located on the search server and are addressed as local paths.

Windows file shares

These are file shares that are served using the SMB or CIFS protocols, as is standard in most Windows environments. They can be addressed as UNC paths. How the data source is specified will depend on where the data is located. For example, a filecopy data source might have:

  • For a local disk: filecopy.source=/var/documents/shared/

  • For a windows fileshare: filecopy.source=\\fileserver\documents\ or filecopy.source=smb://fileserver/documents/

Note that on Linux operating systems, the default firewall rules may need to be altered to allow for SMB / CIFS name resolution.

RedHat Linux provides instructions for mounting NFS file shares and also comes with SMB/CIFS support

File shares mounted on a Windows machine can be indexed in a similar way, and will provide SMB/CIFS support. Please note that drive letter mappings are done or a per-user basis, so paths must be specified as UNC paths (e.g. \\fileserver\directory) for remote file shares.

Document level security

Document level security is supported on Windows to ensure that users can only access the files they are authorized to see.

Serving fileshare results

Fileshare results are served by the user interface layer: It will contact the fileshare to retrieve the requested file and download it to the search user browser. As part of its operation it will perform all required access checks to ensure a user only sees documents they are authorized to see.

Document filtering

Apache Tika is used to convert binary document formats to text. Additional custom filtering can be applied through custom filters.

Additional file types (if supported by Tika) can be filtered by adding the types to filecopy.filetypes and filter.tika.types

Configuration options

The directory data source is defined by configuring the following properties:

Option Description


Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from.


Whether to index or not the file names of files that failed to filter.


Filecopy sources that require a username to access files will use this setting as a domain for the user.


Filecopy data sources will exclude files which match this regular expression.


The list of filetypes (i.e. file extensions) that will be included by a filecopy data source.


If specified, filecopy data sources will only include files which match this regular expression.


If set, this limits the number of documents a filecopy data source with gather when updating.


Number of fetchers threads for interacting with the fileshare in a filecopy data source.


Number of worker threads for filtering and storing files in a filecopy data source.


Filecopy sources that require a password to access files will use this setting as a password.


Optional parameter to specify how long to delay between copy requests in milliseconds.


Sets the plugin to use to collect security information on files (Early binding Document Level Security.


This is the file system path or URL that describes the source of data files.


If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source. NOTE: Specifying this option will cause the filecopy.source to be ignored.


Specifies which storage class to be used by a filecopy data source (e.g. WARC, Mirror).


Filecopy sources that require a username to access files will use this setting as a username.


Main class used by the filecopier to walk a file tree


Specifies which java classes should be used for filtering documents.


Specifies which file types to filter using the TikaFilterProvider

Filecopier log level

This functionality is only available to Funnelback system administrators.

The steps below set the log level for a filecopy collection.

  1. Copy $SEARCH_HOME/conf/log4j2.xml.default to $SEARCH_HOME/conf/<collection>/log4j2.xml

  2. Edit the file and update the line below to the desired level.

    <Logger name="com.funnelback" level="info"/>
    <!-- eg. increase to debug level: -->
    <Logger name="com.funnelback" level="debug"/>
  3. Save the file and start and update observing debug messages now appear in the filecopier.log