Filecopy data sources

This feature is not available to users of the Squiz Experience Cloud version of Funnelback.

A filecopy data source is used for indexing documents from a file share or a local disk. It is made from a copy of the documents from a local or remote filesystem directory/folder.

File-copy-collections.png

An update will copy new or changed files from the source folder into the data source’s offline data directory from where the update will proceed as normal. Binary documents are converted into text, text content is indexed, and the offline view is swapped with the live view.

Create the data source

Filecopy data sources are created by following the data source creation steps and selecting filecopy from the list of data source types.

A filecopy collection is defined by the following properties:

Supported directories

Funnelback supports the indexing of various different types of directory. These include:

Local directories

These are located on the search server and are addressed as local paths.

Windows file shares

These are file shares that are served using the SMB or CIFS protocols, as is standard in most Windows environments. They can be addressed as UNC paths. How the data source is specified will depend on where the data is located. For example, a filecopy data source might have:

  • For a local disk: filecopy.source=/var/documents/shared/

  • For a windows fileshare: filecopy.source=\\fileserver\documents\ or filecopy.source=smb://fileserver/documents/

Note that on Linux operating systems, the default firewall rules may need to be altered to allow for SMB / CIFS name resolution.

RedHat Linux provides instructions for mounting NFS file shares and also comes with SMB/CIFS support

File shares mounted on a Windows machine can be indexed in a similar way, and will provide SMB/CIFS support. Please note that drive letter mappings are done or a per-user basis, so paths must be specified as UNC paths (e.g. \\fileserver\directory) for remote file shares.

Document level security

Document level security is supported on Windows to ensure that users can only access the files they are authorized to see.

Serving fileshare results

Fileshare results are served by the user interface layer: It will contact the fileshare to retrieve the requested file and download it to the search user browser. As part of its operation it will perform all required access checks to ensure a user only sees documents they are authorized to see.

Document filtering

Apache Tika is used to convert binary document formats to text. Additional custom filtering can be applied through custom filters.

Additional file types (if supported by Tika) can be filtered by adding the types to filecopy.filetypes and filter.tika.types

Configuration options

The directory data source is defined by configuring the following properties:

Option Description

filecopy.cache

Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from.

filecopy.discard_filtering_errors

Whether to index or not the file names of files that failed to filter.

filecopy.domain

Filecopy sources that require a username to access files will use this setting as a domain for the user.

filecopy.exclude_pattern

Filecopy data sources will exclude files which match this regular expression.

filecopy.filetypes

The list of filetypes (i.e. file extensions) that will be included by a filecopy data source.

filecopy.include_pattern

If specified, filecopy data sources will only include files which match this regular expression.

filecopy.max_files_stored

If set, this limits the number of documents a filecopy data source with gather when updating.

filecopy.num_fetchers

Number of fetchers threads for interacting with the fileshare in a filecopy data source.

filecopy.num_workers

Number of worker threads for filtering and storing files in a filecopy data source.

filecopy.passwd

Filecopy sources that require a password to access files will use this setting as a password.

filecopy.request_delay

Optional parameter to specify how long to delay between copy requests in milliseconds.

filecopy.security_model

Sets the plugin to use to collect security information on files (Early binding Document Level Security.

filecopy.source

This is the file system path or URL that describes the source of data files.

filecopy.source_list

If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source. NOTE: Specifying this option will cause the filecopy.source to be ignored.

filecopy.store_class

Specifies which storage class to be used by a filecopy data source (e.g. WARC, Mirror).

filecopy.user

Filecopy sources that require a username to access files will use this setting as a username.

filecopy.walker_class

Main class used by the filecopier to walk a file tree

filter.classes

Specifies which java classes should be used for filtering documents.

filter.tika.types

Specifies which file types to filter using the TikaFilterProvider

© 2015- Squiz Pty Ltd