A collection is a set of data that has been gathered from a data source, indexed and made available for searching. Collection types are based on their data sources:
- A web site or set of sites. The data is gathered using HTTP, or HTTPS, and plain text is extracted from binary document types such as MS-Word and PDF.
- A local file-system. The data is not gathered, but indexed in place.
- A file-system. The data is gathered by copying files and the text is extracted from binary documents.
- A database. The data is gathered using a JDBC driver to connect to the database and selecting one or more tables. The data is stored locally as XML.
- A directory (generally of people). The data is gathered using a JNDI driver to access and ActiveDirectory or LDAP directory. The data is stored locally as XML.
- A TRIM collection that can be searched while it's crawled.
- A collection specifically designed for searching content from the Squiz Matrix CMS.
- A collection where data is 'pushed' into the index through an API rather than being gathered by Funnelback.
- A collection where data is gathered by a custom script.
- social media
- A custom collection where data is gathered from a social media website.
- This is a grouping of one or more collections to provide querying over all data in the collections.
Populating a collection
A collection is populated in the following order:
- The data is gathered. For example, if it is a web collection the web sites will be crawled to download all HTML files and other documents.
- All "binary" documents are filtered to extract plain text. For example, PDF files will be processed to extract the text.
- The documents will be indexed: word lists and other information will be processed into Funnelback indexes. The index is then used to answer user queries.
All of this work occurs in an offline area to prevent disrupting the current live view which is being used for query processing. If the update process completed successfully, the live and offline views will be swapped, making the new indexes available for querying.
For details on how to manage Funnelback collections, see the following:
- Creating collections
- Editing collections
- Updating collections
- Deleting collections
- Scheduling collection updates
- Viewing collection status