File system (filecopy) data sources

This feature is not available in the Squiz DXP.

A file system (filecopy) data source is used for indexing documents from a file system. The file system can be accessed locally or remotely.

An update will copy new or changed files from the source folder into the data source’s offline data directory from where the update will proceed as normal. Binary documents are converted into text, text content is indexed, and the offline view is swapped with the live view.

Create the data source

File system data sources are created by following the data source creation steps and selecting filecopy from the list of data source types.

A file system data source is defined by the following properties:

Supported directories

Funnelback supports the indexing of various different types of directory. These include:

Local directories

These are located on the search server and are addressed as local paths.

Windows file shares

These are file shares that are served using the SMB or CIFS protocols, as is standard in most Windows environments. They can be addressed as UNC paths. How the data source is specified will depend on where the data is located. For example, a file system data source might have:

  • For a local disk: filecopy.source=/var/documents/shared/

  • For a Windows file share: filecopy.source=\\fileserver\documents\ or filecopy.source=smb://fileserver/documents/

Note that on Linux operating systems, the default firewall rules may need to be altered to allow for SMB / CIFS name resolution.

RedHat Linux provides instructions for mounting NFS file shares and also comes with SMB/CIFS support

File shares mounted on a Windows machine can be indexed in a similar way, and will provide SMB/CIFS support. Please note that drive letter mappings are done or a per-user basis, so paths must be specified as UNC paths (e.g. \\fileserver\directory) for remote file shares.

Serving file system results

File system results are served by the user interface layer. It will contact the file system to retrieve the requested file and download it to the search user browser.

Document filtering

Apache Tika is used to convert binary document formats to text. Additional filtering can be applied using Funnelback plugins.

Additional file types (if supported by Tika) can be filtered by adding the types to filecopy.filetypes and filter.tika.types

If you’ve updated the filter chain or how a filter works, you may need to disable the filecopy.cache to ensure the changes are applied to any previously processed documents.

Configuration options

The file system data source is defined by configuring the following properties:

Standard options

Option Description

Filter options

Option Description

Advanced options

Option Description

Filecopier log level

This functionality is only available to Funnelback system administrators.

The steps below set the log level for a filecopy data source.

  1. Copy $SEARCH_HOME/conf/log4j2.xml.default to $SEARCH_HOME/conf/<collection>/log4j2.xml

  2. Edit the file and update the line below to the desired level.

    <Logger name="com.funnelback" level="info"/>
    
    <!-- eg. increase to debug level: -->
    <Logger name="com.funnelback" level="debug"/>
  3. Save the file and start and update observing debug messages now appear in the filecopier.log

See also