Data sources

A data source contains configuration and indexes relating to a set of information resources such as web pages, documents or social media posts.

Each data source implements a gatherer (or connector) that is responsible for connecting to the content repository and gathering the information to include in the index.

A data source is similar to a non-meta collection in older versions of Funnelback.

Data source types

Funnelback includes a number of data source types:

  • Web: Web content gathered using a web crawler. Includes websites and other resources made available via http.

  • Custom: Information gathered using a custom gatherer. Commonly used to index content obtained via APIs or an SDK. The custom gatherer is implemented via a Funnelback plugin.

  • Database: Information gathered from the result of running an SQL query against a compatible database.

  • Directory: Information gathered from the result of running a directory (e.g. LDAP) query against a compatible directory.

  • Facebook: Facebook content gathered using Facebook’s APIs.

  • Filecopy: Documents gathered from a fileshare.

  • Flickr: Flickr content gathered using Flickr’s APIs.

  • Twitter: Twitter content gathered using Facebook’s APIs.

  • YouTube: YouTube content gathered using Facebook’s APIs.

  • Index only (Push): An index-only data source that allows the indexing of content that is added via an API. An index only data source does not include gathering logic.

Populating a data source

Fb-update-steps.png

A data source is populated in the following order:

  1. The data is gathered. For example, if it is a web collection the web sites will be crawled to download all HTML files and other documents.

  2. All "binary" documents are filtered to extract plain text. For example, PDF files will be processed to extract the text.

  3. The documents will be indexed: word lists and other information will be processed into Funnelback indexes. The index is then used to answer user queries.

All of this work occurs in an offline area to prevent disrupting the current live view which is being used for query processing. If the update process completed successfully, the live and offline views will be swapped, making the new indexes available for querying.

Manage a data source

For details on how to manage Funnelback data sources, see the following:

© 2015- Squiz Pty Ltd