Data sources

A data source contains configuration and indexes relating to a set of information resources such as web pages, documents or social media posts.

Each data source implements a gatherer (or connector) that is responsible for connecting to the content repository and gathering the information to include in the index.

A data source is similar to a non-meta collection in older versions of Funnelback.

Data source types

Funnelback includes a number of data source types:S

  • Web

    Web content gathered using a web crawler. Includes websites and other resources made available via http.

  • Custom

    Acalog content gathered via the Acalog APIs using a custom data source and the Acalog gatherer plugin

  • Custom

    Information gathered using a custom gatherer. Commonly used to index content obtained via APIs or an SDK. The custom gatherer is implemented via a Funnelback plugin.

  • Database

    Information gathered from the result of running an SQL query against a compatible database.

  • Directory

    Information gathered from the result of running a directory (e.g. LDAP) query against a compatible directory.

  • Facebook

    Facebook content gathered using Facebook’s APIs.

  • Fileshare (filecopy)

    Documents gathered from a fileshare.

  • Flickr

    Flickr content gathered using Flickr’s APIs.

  • Instagram

    Instagram content gathered using Instagram’s APIs using a custom data source and the Instagram gatherer plugin

  • SFTP

    Documents gathered from an SFTP server.

  • Twitter

    Twitter content gathered using Twitter’s APIs.

  • Vimeo

    Vimeo content gathered using Vimeo’s APIs.

  • YouTube

    YouTube content gathered using YouTube’s APIs.

  • Index only (Push)

    An index-only data source that allows the indexing of content that is added via an API. An index only data source does not include gathering logic.

Populating a data source

Fb-update-steps.png

A (non-push) data source update follows the following high level steps:

  1. The data is gathered. For example, if it is a web data source the web sites will be crawled to download all HTML files and other documents.

  2. All "binary" documents are filtered to extract plain text. For example, PDF files will be processed to extract the text.

  3. The documents will be indexed: word lists and other information will be processed into Funnelback indexes. The index is then used to answer user queries.

All of this work occurs in an offline area to prevent disrupting the current live view which is being used for query processing. If the update process completed successfully, the live and offline views will be swapped, making the new indexes available for querying.

See: Gather and index update flows for detailed information on all the update steps.

Manage a data source

The data source management screen is accessed in the following way:

  1. From the search dashboard home page locate your search package in the main listing.

  2. Click on the data sources tab for the search package.

  3. Click on the name of the data source that you wish to edit, or select configuration from the settings menu.

For details on how to manage Funnelback data sources, see the following: