Funnelback logo

Documentation

CATEGORY

Collection.cfg

Introduction

The file $SEARCH_HOME/conf/$COLLETION_NAME/collection.cfg is the main configuration file for a collection. The following tables contain descriptions of the options that are used in the configuration file. Note that some are specific to the collection's type, while others are used for every collection.

The format of the file is a simple name=value pair per line, with the values $SEARCH_HOME and $COLLECTION_NAME automatically expanded to the funnelback installation path and the name of the current collection automatically.

The collection.cfg configuration file is created when a collection is created and may be updated whenever a collection is updated.

Standard Funnelback default values for each configuration option are defined in $SEARCH_HOME/conf/collection.cfg.default and server-wide default values may be configured by adding them to the file at $SEARCH_HOME/conf/collection.cfg

Configuration options

A

Option Description
access_alternate Switch the user to an alternate collection if access_restriction applies.
access_restriction Restricts access by listing allowable hostnames or IP addresses. Only users with matching hostname or IP address can search.
admin.undeletable If set to "true" this collection can not be deleted from the Administration interface.
admin_email Email address of administrator to whom an email is sent after each collection update.
analytics.data_miner.range_in_days Length of time range (in days) the analytics data miner will go back from the current date when mining query and click log records.
analytics.outlier.exclude_collection Disable query spike detection for a collection
analytics.reports.max_day_resolution_daterange Length of time range (in days) to allow in a custom daterange in the query reports UI.
analytics.reports.max_facts_per_dimension_combination Advanced setting: controls the amount of data that is stored by query reports.
analytics.reports.checkpoint_rate Advanced setting: controls the rate at which the query reports system checkpoints data to disk.
analytics.reports.disable_incremental_reporting Disable incremental reports database updates. If set all existing query and click logs will be processed for each reports update.
annie.index_opts Specify options for the "annie-a" annotation indexing program.

C

Option Description
changeover_percent The new crawl only goes live if the ratio of new vs. old documents gathered is greater than this amount (e.g. 50%).
click_data.archive_dirs The directories that contain archives of click logs to be included in producing indexes.
click_data.num_archived_logs_to_use The number of archived click logs to use from each archive directory
click_data.use_click_data_in_index A boolean value indicating whether or not click information should be included in the index.
click_data.week_limit Optional restriction of click data to a set number of weeks into the past.
click_tracking Enable or disable click tracking.
collection The internal name of a collection.
collection_root Location of a collection's crawl, index, query logs etc
collection_type Type of collection.
connector.additional.max_files_stored Maximum number of entities the connector should gather before stopping (default is unlimited)
connector.classname Type of connector class to use.
connector.credentials.domain Domain for specified user when using connector to interact with a repository.
connector.credentials.Password Password for specified user when using connector to interact with a repository.
connector.credentials.UserName Username used by connector when interacting with a repository server.
connector.custom_action_java_class Allows a custom java class to modify data extracted from an enterprise repository before indexing.
connector.discover Enables repositories discovery instead of targeting a specific repository. For example with the Exchange connector each mailbox is a different repository.
connector.change_detection Enable/disable gathering of only changed content.
connector.mapping.filename Sets which property of gathered items to use as filenames.
connector.maximum_concurrent_connections Maximum number of concurrent connections allowed to the repository
connector.repository.exclude_pattern Repositories with name matching this pattern will be ignored, when discovering repositories.
connector.repository.include_pattern Repositories with name matching this pattern will be included, when discovering repositories.
connector.selection_policy.class Class used to control selection of content to gather
contentoptimiser.non.admin.access Controls non-admin access to the content optimiser
continous-updating.access-port Specifies network port to access the restful API of continuous updating
continous-updating.secure-access-port Specifies the https network port to access the restful API of continuous updating
continous-updating.enable-secure-access Specifies if the restful api should use https
continous-updating.configuration-refresh-interval Controls automatic refreshing of configuration options
continous-updating.logfile-pattern Specifies format of archived log files
continous-updating.syslog-facility Specifies syslog facilty type.
continous-updating.log-generation-limit Specifies number of log generations to retain
continous-updating.log-pattern Specifies log message format
continous-updating.syslog-host Specifies the syslog host
continous-updating.log-interval Specifies log rolling over interval
continous-updating.log-level Specifies the reporting detail level of log messages
continous-updating.log-max-size Specifies the maximum size of a log file
continous-updating.log-name Specifies the collection log file name
continous-updating.log-type Specifies the log type
continous-updating.admin.auto-commit-timeout Specifies the timeout before an automatic commit of content changes.
continous-updating.consolidation-mgmt.consolidation-max-threads Specifies the maximum number of concurrent consolidations for a collection.
continous-updating.restful.allow-remote Specifies whether restful connections from other systems are allowed.
continous-updating.restful.refresh-interval Specifies time to live of cached realm entries.
crawler.accept_cookies Cookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true.
crawler.accept_files Only crawl files with these extensions. Not normally used - default is to accept all valid content.
crawler.cache.DNSCache_max_size Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.
crawler.cache.LRUCache_max_size Maximum size of LRUCache. Upon reaching this size the cache will drop old elements.
crawler.cache.URLCache_max_size Maximum size of URLCache. May be ignored by some cache implementations.
crawler.check_alias_exists Check if aliased URLs exists - if not, revert back to original URL
crawler.checkpoint_to Location of crawler checkpoint files.
crawler.classes.AreaCache Java class used for storing information about "areas" (directories, generators)
crawler.classes.Crawler Java class used by crawler - defines top level behaviour, which protocols are supported etc.
crawler.classes.Frontier Java class used for the frontier (a list of URLs not yet visited)
crawler.classes.Policy Java class used for enforcing the include/exclude policy for URLs
crawler.classes.RevisitPolicy Java class used for enforcing the revisit policy for URLs
crawler.classes.Scanner Java class used when parsing HTML pages for new links
crawler.classes.ServerCache Java class used to cache server signature information
crawler.classes.ServerInfoCache Java class used for storing information about servers
crawler.classes.SignatureCache Signature cache class to use (store page checksums)
crawler.classes.StoreCache Java class used for storing information about stored content
crawler.classes.URLCache Java class used for the URL cache, the set of all URLs seen so far during a crawl
crawler.classes.URLStore Java class used to store pages on disk e.g. create a mirror of files crawled
crawler.eliminate_duplicates Whether to eliminate duplicate documents while crawling (default is true)
crawler.extract_links_from_javascript Whether to extract links from Javascript while crawling (default is true)
crawler.follow_links_in_comments Whether to follow links in HTML comments while crawling (default is false)
crawler.frontier_use_ip_mapping Whether to map hosts to frontiers based on IP address. (default is false)
crawler.form_interaction_file Path to optional file which configures interaction with form-based authentication
crawler.header_logging Option to control whether HTTP headers are written out to a separate log file (default is false)
crawler.incremental_logging Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling
crawler.inline_filtering_enabled Option to control whether text extraction from binary files is done "inline" during a web crawl
crawler.link_extraction_group The group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL.
crawler.link_extraction_regular_expression The expression used to extract links from each document. This must be a Perl compatible regular expression.
crawler.logfile The crawler's log path and filename.
crawler.lowercase_iis_urls Whether to lowercase all URLs from IIS web servers (default is false)
crawler.max_dir_depth A URL with more than this many sub directories will be ignored (too deep, probably a crawler trap)
crawler.max_download_size Maximum size of files crawler will download (in MB)
crawler.max_files_per_area Maximum files per "area" e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size
crawler.max_files_per_server Maximum files per server (default is unlimited)
crawler.max_files_stored Maximum number of files to download (default, and less than 1, is unlimited)
crawler.max_individual_frontier_size Maximum size of an individual frontier (unlimited if not defined)
crawler.max_link_distance How far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation.
crawler.max_parse_size Crawler will not parse documents beyond this many megabytes in size
crawler.max_timeout_retries Maximum number of times to retry after a network timeout (default is 0)
crawler.max_url_length A URL with more characters than this will be ignored (too long, probably a crawler trap)
crawler.monitor_checkpoint_interval Time interval at which to checkpoint (seconds)
crawler.monitor_delay_type Type of delay to use during crawl (dynamic or fixed)
crawler.monitor_halt Checked during a crawl - if set to "true" then crawler will cleanly shutdown
crawler.monitor_preferred_servers_list Optional list of servers to prefer during crawl
crawler.monitor_time_interval Time interval at which to output monitoring information (seconds)
crawler.monitor_url_reject_list Optional parameter listing URLs to reject during a running crawl
crawler.non_html Which non-html file formats to crawl (e.g. pdf, doc, xls etc.)
crawler.num_crawlers Number of crawler threads which simultaneously crawl different hosts
crawler.overall_crawl_timeout Maximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter.
crawler.overall_crawl_units The units for the crawler.overall_crawl_timeout parameter. A value of hr indicates hours and min indicates minutes.
crawler.packages.httplib Java library for HTTP/HTTPS support.
crawler.parser.mimeTypes Extract links from these comma-separated or regexp: content-types.
crawler.protocols Crawl URLs via these protocols (comma separated list)
crawler.reject_files Do not crawl files with these extensions
crawler.remove_parameters Optional list of parameters to remove from URLs
crawler.request_delay Milliseconds between HTTP requests (for a particular thread)
crawler.request_header Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefix Optional URL prefix to be applied when processing the crawler.request_header parameter
crawler.request_timeout Timeout for HTTP page GETs (milliseconds)
crawler.revisit.edit_distance_threshold Threshold for edit distance between two versions of a page when deciding whether it has changed or not
crawler.revisit.num_times_revisit_skipped_threshold Threshold for number of times a page revisit has been skipped when deciding whether to revisit it.
crawler.revisit.num_times_unchanged_threshold Threshold for number of times a page has been unchanged when deciding whether to revisit it.
crawler.robotAgent Matching is case-insensitive over the length of the name in a robots.txt file
crawler.secondary_store_root Location of secondary (previous) store - used in incremental crawling
crawler.server_alias_file Path to optional file containing server alias mappings e.g. www.daff.gov.au=www.affa.gov.au
crawler.sslClientStore Path to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java's keytool
crawler.sslClientStorePassword Password for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java's keytool
crawler.sslTrustEveryone Trust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing.
crawler.sslTrustStore Path to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java's keytool
crawler.start_urls_file Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the start_url that the crawler is passed on startup (usually stored in the crawler.start_url configuration option).
crawler.store_all_types If true, override accept/reject rules and crawl and store all file types encountered
crawler.store_headers Write HTTP header information at top of HTML files if true. Header information is used by indexer.
crawler.user_agent The browser ID that the crawler uses when making HTTP requests. We use a browser signature so that web servers will return framed content etc. to us.
crawler.use_sitemap_xml Optional parameter specifying whether to process sitemap.xml files during a web crawl.
crawler.verbosity Verbosity level (0-6) of crawler logs. Higher number results in more messages.
crawler The name of the crawler binary.
crawler_binaries Location of the crawler files.
custom.use_xsl Use an XSLT to display cached documents.

D

Option Description
data_report A switch that can be used to enable or disable the data report stage during a collection update.
data_root The directory under which the documents to index reside
db.bundle_storage_enabled Allows storage of data extracted from a database in a compressed form.
db.custom_action_java_class Allows a custom java class to modify data extracted from a database before indexing.
db.full_sql_query The SQL query to perform on a database to fetch all records for searching.
db.incremental_sql_query The SQL query to perform to fetch new or changed records from a database.
db.incremental_update_type Allows the selection of different modes for keeping database collections up to date.
db.jdbc_class The name of the Java JDBC driver to connect to a database.
db.jdbc_url The URL specifying database connection parameters such as the server and database name.
db.password The password for connecting to the database.
db.primary_id_column The primary id (unique identifier) column for each database record.
db.xml_root_element The top level element for records extracted from the database.
db.single_item_sql An SQL command for extracting an individual record from the database
db.update_table_name The name of a table in the database which provides a record of all additions, updates and deletes.
db.username The username for connecting to the database.
document_level_security.action Sets the type of security check that should be performed on this collection when using a custom document level security filter.
document_level_security.custom_command Sets the script that performs the security check when using a custom document level security filter. Mainly included for extensibility of the document level security model.
document_level_security.max2check Sets the maximum number of documents to perform document level security checking on. Unchecked documents are never returned.
document_level_security.mode Sets the document level security mode that will be used when searching through this collection (Windows Only).
duplicate_detection Post-crawl duplicate detection setting. The default is off, delete means delete duplicates, check means check only (don't delete). Not normally used any more, as crawler performs its own duplicate detection.

E

Option Description
enable_clean_html Public results pages will have whitespace and comments removed if this option is enabled
exclude_patterns The crawler will ignore a URL if it matches any of these exclude patterns

F

Option Description
faceted_navigation.white_list Include only a list of specific values for a facet (Modern UI only).
faceted_navigation.black_list Exclude specific values for a facet (Modern UI only).
filecopy.cache Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from.
filecopy.domain Filecopy sources that require a username to access files will use this setting as a domain for the user.
filecopy.discard_filtering_errors Whether to index or not the file names of files that failed to filter.
filecopy.exclude_pattern Filecopy collections will exclude files which match this regular expression.
filecopy.filetypes The list of filetypes (i.e. file extensions) that will be included by a filecopy collection.
filecopy.include_pattern If specified, filecopy collections will only include files which match this regular expression.
filecopy.inline_filtering Enable/disable inline filtering of files during gatehring.
filecopy.max_files_stored If set, this limits the number of documents a filecopy collection with gather when updating.
filecopy.novell.mount_point The volume path on the Netware server on which the Netware fileshare is mounted.
filecopy.novell.server The name of the Netware server on which the Netware fileshare is mounted.
filecopy.passwd Filecopy sources that require a password to access files will use this setting as a password.
filecopy.request_delay Optional parameter to specify how long to delay between copy requests in milliseconds.
filecopy.source This is the file system path or URL that describes the source of data files.
filecopy.store_class The local data cache storage class
filecopy.security_model Sets the plugin to use to collect security information on files (Early binding Document Level Security).
filecopy.source_list If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source. NOTE: Specifying this option will cause the filecopy.source to be ignored.
filecopy.user Filecopy sources that require a username to access files will use this setting as a username.
filter Whether to filter (convert) files such as doc,rtf,pdf,ps to plain text (for indexing). This operation will delete the original files. Do not use with collection_type=local
filter.classes Optionally specify which java classes should be used for filtering documents.
filter.discard_filtering_errors Controls whether files that failed to filter should be deleted or not.
filter.isolated.classes Specify which java classes should be used in isolated filter mode.
filter.num_worker_threads Specify number of parallel threads to use in document filtering (text extraction)
form_security.allow_exec_tags Used for search form security. Enables or disables the use of "execute" (eg: EvalPerl) tags in search forms.
form_security.allow_read_tags Used for search form security. Enables or disables the use of "read" (eg: IncludeFile) tags in search forms.
form_security.allow_query_transforms Used for search form security. Enables or disables the use of query transformations.
form_security.allow_result_transforms Used for search form security. Enables or disables the use of result transforms.

G

Option Description
gather The mechanism used to gather documents for indexing. "crawl" indicates Web retrieval whereas "filecopy" indicates a local or remote file copy.

H

Option Description
http_passwd Password used for accessing password protected content during a crawl
http_proxy The hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.
http_proxy_passwd The proxy password to be used during crawling
http_proxy_port Port of HTTP proxy used during crawling
http_proxy_user The proxy user name to be used during crawling
http_source_host IP address or hostname used by crawler, on a machine with more than one available
http_user Username used for accessing password-protected content during a crawl

I

Option Description
include_patterns URLs matching this are included in crawl (unless exclude_patterns) e.g. usyd.edu.au, anu.edu.au, www.anutech.com.au/ELC/
index A switch that can be used to enable or disable the indexing stage during a collection update.
indexer The name of the indexer program to be used for this collection.
indexer_options Indexer command line options, each separated by whitespace and thus cannot contain embedded whitespace characters.
indexing.use_manifest Flag to turn on use of a manifest file for indexing

J

Option Description
java_libraries The path where the Java libraries are located.
java_options Command line options to pass to the Java virtual machine when the crawler is launched.

M

Option Description
mail.on_failure_only Wether to always send collection update emails or only when an update fails)
max_heap_size Heap size used by Funnelback Java processes (in megabytes)

N

Option Description
noindex_expression Optional regular expression to specify content that should not be indexed

P

Option Description
post_gather_command Optional command to execute after gathering phase finishes.
post_index_command Command to execute after indexing finishes.
post_update_command Command to execute once an update has finished (update email will already have been sent).
pre_gather_command Command to execute before gathering starts.
pre_index_command Command to execute before indexing commences.

Q

Option Description
query_completion Enable or disable query completion.
query_completion.alpha Adjust balance between relevancy and length for query completion suggestions.
query_completion.delay Delay to wait (ms) before triggering query completion.
query_completion.length Minimum length of query term before triggering query completion.
query_completion.program Program to use for query completion.
query_completion.show Maximum number of query completion to show.
query_completion.sort Sets the query completion suggestions sort order.
query_completion.source Sets the source of the data for query completion suggestions
query_completion.source.extra Sets extra sources of data for query completion suggestions
query_processor The name of the query processor executable to use.
query_processor_options Query processor command line options.

R

Option Description
related.api_enabled Enables/disables API access to Related Information (Classic UI) via related.cgi
result_transform A Perl command (or a set of semi-colon separated Perl commands) that is applied to each result in a query result set.

S

Option Description
schedule.incremental_crawl_ratio The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl).
search_user Name of user who runs collection updates
secure_dirs A security breach warning system.
security.earlybinding.reader-permissions A comma separated list of permissions which permit a user to read documents
security.earlybinding.user-to-key-mapper Selected security plugin for translating usernames into lists of document keys
security.earlybinding.user-to-key-mapper.cache-seconds Number of seconds for which a users's list of keys may be cached
security.earlybinding.locks-keys-matcher.name Name of security plugin library that matches user keys with document locks at query time
security.earlybinding.locks-keys-matcher.ldlibrarypath Full path to security plugin library
service_name Name of service as displayed to users e.g. Intellectual Property Portal
spelling.suggestion_lexicon_weight Specify weighting to be given to suggestions from the lexicon (list of words from indexed documents) relative to other sources (e.g. annotations)
spelling.suggestion_sources Specify sources of information for generating spelling suggestions.
spelling.suggestion_threshold Threshold which controls how suggestions are made.
spelling_enabled Whether to enable spell checking in the search interface (true or false).
start_url Crawler seed URL. Crawler follows links in this page, and then the links of those pages and so on.

T

Option Description
tagging.enable tagging Turn result tagging on or off
tagging.use_tag_data_in_index Control whether to use tag data during indexing
trim.cleanup_webserverworkpath TRIM web server work path / temporary directory cleanup interval
trim.database The 2-digit identifier of the TRIM database to index
trim.default_live_links Whether search results links should point to a copy of TRIM document, or launch TRIM client.
trim.extracted_file_types A list of file extensions that will be extracted from TRIM databases.
trim.gather_start_date The date from which newly registered or modified documents will be gathered.
trim.gather_end_date The date at which stop the gather process.
trim.license_number TRIM license number as found in the TRIM client system information panel.
trim.initial_gather Select the way records are selected from TRIM, depending if it's the very first gather or not.
trim.limit The maximum number of records to extract.
trim.slice_size The number of records to extract before closing and re-opening the TRIM database.
trim.slice_sleep How many milliseconds to sleep between slices.
trim.sub_folders How many sub-folders to create in the 'data' directory.
trim.verbose Define how verbose the TRIM crawl is.
trim.workgroup_port The port on the TRIM workgroup server to connect to when gathering content from TRIM.
trim.workgroup_server The name of the TRIM workgroup server to connect to when gathering content from TRIM.

U

Option Description
ui.null_query_enabled Control whether "null query" support is enabled. If it is and no query is specified in a search request then the system will try to display a list of "important" documents.
ui.modern.click_link References the URL used to log result clicks (Modern UI only)
ui.modern.extra_searches Configure extra searches to be aggregated with the main result data, when using the Modern UI.
ui.modern.form.content_type Specify a custom content type header for a form file (Modern UI only).
ui.modern.form.headers.count Specify the count of custom headers for a form file (Modern UI only).
ui.modern.form.headers Specify custom headers for a form file (Modern UI only).
ui.modern.freemarker.display_errors Whether to display form files error messages on the browser or not (Modern UI only).
ui.modern.freemarker.error_format Format of form files error messages displayed on the browser (Modern UI only).
ui.modern.search_link Base URL used by search.html to link to itself e.g. the next page of search results. Allows search.html (or a pass-through script) to have a name other than search.html.
ui.secure_users Can be used to restrict search interface access to a list of users.
ui_cache_disabled Disable cache.cgi from accessing any cached documents.
ui_cache_link Base URL used by PADRE to link to the cached copy of a search result. Can be an absolute URL.
ui_click_link References the URL used to log result clicks (Classic UI only)
ui_cookie_domain Specifies the domain to use if saving cookies.
ui_hit_first HTML anchor text to the first match in a document (cache.cgi).
ui_hit_last HTML anchor text to the last match in a document (cache.cgi).
ui_hit_next HTML anchor text to the next match in a document (cache.cgi).
ui_hit_prev HTML anchor text to the previous match in a document (cache.cgi).
ui_search_link Base URL used by search.cgi to link to itself e.g. the next page of search results. Allows search.cgi (or a pass-through script) to have a name other than search.cgi. Can be an absolute URL.
ui_type Which User interface is used on the collection.
update.restrict_to_host Specify that collection updates should be restricted to only run on a specific host.
userid_to_log Controls how logging of IP addresses is performed.

V

Option Description
vital_servers Changeover only happens if vital_servers exist in the new crawl.

See also

top ⇑