Data source and search package configuration options (collection.cfg)

Data source and search package configuration options are used to configure various features within Funnelback.

The configuration options can be edited using the data source or search package configuration editor.

Special variables $SEARCH_HOME and $COLLECTION_NAME contained within a key value are automatically expanded to the Funnelback installation path and the ID of the current data source or search package automatically.

Configuration options

The following table outlines the available data source and search package configuration options.

Option Description

accessibility-auditor.check

Turns modern accessibility checks on or off.

accessibility-auditor.min-time-between-recording-history-in-seconds

Specifies how much time must have passed since the last time Accessibility Auditor data was recording before new data will be recorded.

admin.undeletable

(deprecated) This option controls whether a search package or data source can be deleted from the administration dashboard.

admin_email

Specifies an email that will be emailed after each collection update.

analytics.data_miner.range_in_days

Length of time range (in days) the analytics data miner will go back from the current date when mining query and click log records.

analytics.email.addresses

List of email addresses to which reports should be sent.

analytics.email.frequency

How often to send query report summaries.

analytics.email.from

Reports email sender address (From field).

analytics.email.outliers_confidence_threshold

Minimum trend alert confidence value for which emails should be sent.

analytics.email.outliers_enabled

Enables or disables trend alert notifications.

analytics.email.send_hour_spikes

Enable or disable hourly trend alert notifications.

analytics.max_heap_size

Set Java heap size used for analytics.

analytics.outlier.day.minimum_average_count

Control the minimum number of occurrences of a query required before a day pattern can be detected.

analytics.outlier.day.threshold

Control the day pattern detection threshold.

analytics.outlier.exclude_collection

Disable query spike detection (trend alerts) for a collection.

analytics.outlier.exclude_profiles

Disable query spike detection for a profile

analytics.outlier.hour.minimum_average_count

Control the minimum number of occurrences of a query required before an hour pattern can be detected.

analytics.outlier.hour.threshold

Control the hour pattern detection threshold.

analytics.reports.checkpoint_rate

Controls the rate at which the query reports system checkpoints data to disk.

analytics.reports.disable_incremental_reporting

Disable incremental reports database updates. If set all existing query and click logs will be processed for each reports update.

analytics.reports.max_facts_per_dimension_combination

Specifies the amount of data that is stored by query reports.

analytics.scheduled_database_update

Control whether reports for the search package are updated on a scheduled basis.

annie.index_opts

Specify options for the "annie-a" annotation indexing program.

build_autoc_options

Specifies additional configuration options that can be supplied when building auto-completion.

changeover_percent

Specifies a minimum ratio of documents that must be gathered for an update to succeed.

click_data.num_archived_logs_to_use

The number of archived click logs to use from each archive directory.

click_data.use_click_data_in_index

A boolean value indicating if click information should be included in the index.

click_data.week_limit

Optional restriction of click data to a set number of weeks into the past.

collection

The internal name of a data source or search package.

collection-update.step.[stepTechnicalName].run

Determines if an update step should be run or not.

collection_type

Specifies the type of the data source.

contextual-navigation.cannot_end_with

Exclude contextual navigation suggestions which end with specified words or phrases.

contextual-navigation.case_sensitive

List of words to preserve case sensitivity.

contextual-navigation.categorise_clusters

Group contextual navigation suggestions into types and topics.

contextual-navigation.enabled

Enable or disable contextual navigation.

contextual-navigation.kill_list

Exclude contextual navigation suggestions which contain any words or phrases in this list.

contextual-navigation.max_phrase_length

Limit the maximum length of suggestions to the specified number of words.

contextual-navigation.max_phrases

Limit the number of contextual navigation phrases that should be processed.

contextual-navigation.max_results_to_examine

Specify the maximum number of results to examine when generating suggestions.

contextual-navigation.site.granularity

Type of aggregation to be used for contextual navigation site suggestions.

contextual-navigation.site.max_clusters

Defines the maximum number of site suggestions to return in contextual navigation.

contextual-navigation.summary_fields

Metadata classes that are analysed for contextual navigation.

contextual-navigation.timeout_seconds

Timeout for contextual navigation processing.

contextual-navigation.topic.max_clusters

Defines the maximum number of topic suggestions returned by contextual navigation.

contextual-navigation.type.max_clusters

Defines the maximum number of type suggestions returned by contextual navigation.

crawler

Specifies the name of the crawler binary.

crawler.accept_cookies

This option enables, or disables, the crawler’s use of cookies.

crawler.accept_files

Restricts the file extensions the web crawler should crawl.

crawler.additional_link_extraction_pattern

Specifies the regular expression to extract link using the URL features (see below of the detailed conditions by default).

crawler.allow_concurrent_in_crawl_form_interaction

Enable/disable concurrent processing of in-crawl form interaction.

crawler.allowed_redirect_pattern

Specify a regex to allow crawler redirects that would otherwise be disallowed by the current include/exclude patterns.

crawler.cache.DNSCache_max_size

Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.

crawler.cache.LRUCache_max_size

Maximum size of LRUCache. Upon reaching this size the cache will drop old elements.

crawler.cache.URLCache_max_size

Specifies the maximum size of URLCache.

crawler.check_alias_exists

Check if aliased URLs exists - if not, revert to original URL.

crawler.classes.Frontier

Specifies the java class used for the frontier (a list of URLs not yet visited).

crawler.classes.Policy

Specifies the Java class used for enforcing the include/exclude policy for URLs.

crawler.classes.RevisitPolicy

Specifies the Java class used for enforcing the revisit policy for URLs.

crawler.classes.URLStore

Specifies the Java class used to store content on disk e.g. create a mirror of files crawled

crawler.distributor_timeout

Specifies the maximum time the crawler distributor is allowed to run for external crawler only. When exceeded, the crawl will stop and the update will throw an exception.

crawler.distributor_units

Specifies the units for the distributor timeout (for external cralwer only).

crawler.eliminate_duplicates

Whether to eliminate duplicate documents while crawling.

crawler.extract_links_from_javascript

Whether to extract links from Javascript while crawling.

crawler.follow_links_in_comments

Whether to follow links in HTML comments while crawling.

crawler.form_interaction.in_crawl.[groupId].cleartext.[urlParameterKey]

Specifies a clear text form parameter for in-crawl authentication.

crawler.form_interaction.in_crawl.[groupId].encrypted.[urlParameterKey]

Specifies an encrypted form parameter for in-crawl authentication.

crawler.form_interaction.in_crawl.[groupId].url_pattern

Specifies a URL of an HTML web form action for an in-crawl form interaction rule.

crawler.form_interaction.pre_crawl.[groupId].cleartext.[urlParameterKey]

Specifies a clear text form parameter for pre-crawl authentication.

crawler.form_interaction.pre_crawl.[groupId].encrypted.[urlParameterKey]

Specifies an encrypted form parameter for pre-crawl authentication.

crawler.form_interaction.pre_crawl.[groupId].form_number

Specifies which form element at the specified URL should be processed.

crawler.form_interaction.pre_crawl.[groupId].url

Specifies a URL of the page containing the HTML web form for a pre-crawl form interaction rule.

crawler.frontier_hosts

Lists of hosts running crawlers if performing a distributed web crawl.

crawler.frontier_num_top_level_dirs

Specifies the number of top level directories to store disk based frontier files in.

crawler.frontier_port

Port on which DistributedFrontier will listen on.

crawler.frontier_use_ip_mapping

Whether to map hosts to frontiers based on IP address.

crawler.header_logging

Option to control whether HTTP headers are written out to a separate log file.

crawler.ignore_canonical_links

Whether to ignore the canonical link(s) on a page.

crawler.ignore_nofollow

Configures the crawler to ignore robots meta tag "nofollow" and rel="nofollow" directives during a crawl.

crawler.ignore_robots_txt

Enables or disables the web crawler’s robots.txt support.

crawler.incremental_logging

Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling.

crawler.inline_filtering_enabled

Option to control whether text extraction from binary files is done "inline" during a web crawl.

crawler.link_extraction_group

The group in the 'crawler.link_extraction_regular_expression' option which should be extracted as the link/URL.

crawler.link_extraction_regular_expression

Specifies the regular expression used to extract links from each document.

crawler.lowercase_iis_urls

Whether to lowercase all URLs from IIS web servers.

crawler.max_dir_depth

Specifies the maximum number of subdirectories a URL may have before it will be ignored.

crawler.max_download_size

Specifies the maximum size of files the crawler will download (in MB).

crawler.max_files_per_area

Specifies a limit on the number of files from a single directory or dynamically generated URLs that will be crawled.

crawler.max_files_per_server

Specifies the maximum number of files that will be crawled per server.

crawler.max_files_stored

Specifies the maximum number of files to download.

crawler.max_individual_frontier_size

Specifies the maximum size of an individual frontier.

crawler.max_link_distance

Specifies the maximum distance a URL can be from a start URL for it to be downloaded.

crawler.max_parse_size

Sets the maximum size of documents parsed by the crawler.

crawler.max_timeout_retries

Maximum number of times to retry after a network timeout.

crawler.max_url_length

Specifies the maximum length a URL can be in order for it to be crawled.

crawler.max_url_repeating_elements

A URL with more than this many repeating elements (directories) will be ignored.

crawler.monitor_authentication_cookie_renewal_interval

Specifies the time interval at which to renew crawl authentication cookies.

crawler.monitor_checkpoint_interval

Time interval at which to checkpoint (seconds).

crawler.monitor_delay_type

Type of delay to use during crawl (dynamic or fixed).

crawler.monitor_preferred_servers_list

Specifies an optional list of servers to prefer during crawling.

crawler.monitor_time_interval

Specifies a time interval at which to output monitoring information (seconds).

crawler.monitor_url_reject_list

Optional parameter listing URLs to reject during a running crawl.

crawler.non_html

Which non-html file formats to crawl (e.g. pdf, doc, xls etc.).

crawler.ntlm.domain

NTLM domain to be used for web crawler authentication.

crawler.ntlm.password

NTLM password to be used for web crawler authentication.

crawler.ntlm.username

NTLM username to be used for web crawler authentication.

crawler.num_crawlers

Number of crawler threads which simultaneously crawl different hosts.

crawler.overall_crawl_timeout

Specifies the maximum time the crawler is allowed to run. When exceeded, the crawl will stop and the update will continue.

crawler.overall_crawl_units

Specifies the units for the crawl timeout.

crawler.parser.mimeTypes

Extract links from a list of content-types.

crawler.predirects_enabled

Enable crawler predirects.

crawler.protocols

Crawl URLs via these protocols.

crawler.reject_files

Do not crawl files with these extensions.

crawler.remove_parameters

Optional list of parameters to remove from URLs.

crawler.request_delay

Milliseconds between HTTP requests per crawler thread.

crawler.request_header

Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.

crawler.request_header_url_prefix

Optional URL prefix to be applied when processing the 'crawler.request_header' parameter

crawler.request_timeout

Timeout for HTTP page GETs (milliseconds)

crawler.revisit.edit_distance_threshold

Threshold for edit distance between two versions of a page when deciding whether it has changed or not.

crawler.revisit.num_times_revisit_skipped_threshold

Threshold for number of times a page revisit has been skipped when deciding whether to revisit it.

crawler.revisit.num_times_unchanged_threshold

Threshold for the number of times a page has been unchanged when deciding whether to revisit it.

crawler.robotAgent

Matching is case-insensitive over the length of the name in a robots.txt file

crawler.send-http-basic-credentials-without-challenge

Specifies whether HTTP basic credentials should be sent without the web server sending a challenge.

crawler.sslClientStore

Specifies a path to an SSL client certificate store.

crawler.sslClientStorePassword

Password for the SSL client certificate store.

crawler.sslTrustEveryone

Trust all root certificates and ignore server hostname verification.

crawler.sslTrustStore

Specifies the path to an SSL trusted root store.

crawler.start_urls_file

Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl.

crawler.store_all_types

If true, override accept/reject rules and crawl and store all file types encountered

crawler.store_empty_content_urls

Specifies if URLs that contain no content after filtering should be stored.

crawler.store_headers

Whether HTTP header information should be written at the top of HTML files.

crawler.type

Specifies crawler type to use

crawler.use_additional_link_extraction

Whether to extract links based on the URL features while crawling.

crawler.use_sitemap_xml

Specifies whether to process sitemap.xml files during a web crawl.

crawler.user_agent

The browser ID that the crawler uses when making HTTP requests.

crawler.verbosity

Verbosity level (0-7) of crawler logs. Higher number results in more messages.

crawler_binaries

Specifies the location of the crawler files.

custom.base_template

The template used when a custom data source was created.

data_report

A switch that can be used to enable or disable the data report stage during a data source update.

db.bundle_storage_enabled

Allows storage of data extracted from a database in a compressed form.

db.custom_action_java_class

(DEPRECATED) Allows a custom java class to modify data extracted from a database before indexing.

db.full_sql_query

The SQL query to perform on a database to fetch all records for searching.

db.incremental_sql_query

The SQL query to perform to fetch new or changed records from a database.

db.incremental_update_type

Allows the selection of different modes for keeping database collections up to date.

db.jdbc_class

The name of the Java JDBC driver to connect to a database.

db.jdbc_url

The URL specifying database connection parameters such as the server and database name.

db.password

The password for connecting to the database.

db.primary_id_column

The primary id (unique identifier) column for each database record.

db.single_item_sql

An SQL command for extracting an individual record from the database

db.update_table_name

The name of a table in the database which provides a record of all additions, updates and deletes.

db.use_column_labels

Flag to control whether column labels are used in JDBC calls in the database gatherer.

db.username

The username for connecting to the database.

db.xml_root_element

The top level element for records extracted from the database.

directory.context_factory

Sets the java class to use for creating directory connections.

directory.domain

Sets the domain to use for authentication in a directory data source.

directory.exclude_rules

Sets the rules for excluding content from a directory collection.

directory.page_size

Sets the number of documents to fetch from the directory in each request.

directory.password

Sets the password to use for authentication in a directory data source.

directory.provider_url

Sets the URL for accessing the directory in a directory data source.

directory.search_base

Sets the base from which content will be gathered in a directory data source.

directory.search_filter

Sets the filter for selecting content to gather in a directory data source.

directory.username

Sets the username to use for authentication in a directory data source.

exclude_patterns

The crawler will ignore a URL if it matches any of these exclude patterns.

facebook.access-token

Specify an optional access token

facebook.app-id

Specifies the Facebook application ID.

facebook.app-secret

Specifies the Facebook application secret.

facebook.debug

Enable debug mode to preview Facebook fetched records.

facebook.event-fields

Specify a list of Facebook event fields as specified in the Facebook event API documentation

facebook.page-fields

Specify a list of Facebook page fields as specified in the Facebook page API documentation

facebook.page-ids

Specifies a list of IDs of the Facebook pages/accounts to gather.

facebook.post-fields

Specify a list of Facebook post fields as specified in the Facebook post API documentation

faceted_navigation.black_list

Exclude specific values for facets.

faceted_navigation.black_list.[facet]

Exclude specific values for a specific facet.

faceted_navigation.white_list

Include only a list of specific values for facets.

faceted_navigation.white_list.[facet]

Include only a list of specific values for a specific facet.

filecopy.cache

Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from.

filecopy.discard_filtering_errors

Whether to index the file names of files that failed to be filtered.

filecopy.domain

Filecopy sources that require a username to access files will use this setting as a domain for the user.

filecopy.exclude_pattern

File system data sources will exclude files which match this regular expression.

filecopy.filetypes

The list of filetypes (i.e. file extensions) that will be included by a file system data source.

filecopy.include_pattern

If specified, file system data sources will only include files which match this regular expression.

filecopy.max_files_stored

If set, this limits the number of documents a file system data source will gather when updating.

filecopy.num_fetchers

Number of fetcher threads for interacting with the source file system when running a file system data source update.

filecopy.num_workers

Number of worker threads for filtering and storing files in a file system data source.

filecopy.passwd

File system data sources that require a password to access files will use this setting as a password.

filecopy.request_delay

Specifies how long to delay between copy requests in milliseconds.

filecopy.security_model

Sets the plugin to use to collect security information on files.

filecopy.source

This is the file system path or URL that describes the source of data files.

filecopy.source_list

If specified, this option is set to a file which contains a list of other files to copy, rather than using the filecopy.source.

filecopy.store_class

Specifies which storage class to be used by a file system data source (e.g. WARC, Mirror).

filecopy.user

Filecopy sources that require a username to access files will use this setting as a username.

filecopy.walker_class

Main class used by the file system data source to walk a file tree.

filter

Specifies the name of the filter binary for external crawler only.

filter.classes

Specifies which java classes should be used for filtering documents.

filter.csv-to-xml.custom-header

Defines a custom header to use for the CSV.

filter.csv-to-xml.format

Sets the CSV format to use when filtering a CSV document.

filter.csv-to-xml.has-header

Controls if the CSV file has a header or not.

filter.csv-to-xml.url-template

The template to use for the URLs of the documents created in the CSVToXML Filter.

filter.document_fixer.timeout_ms

Controls the maximum amount of time the document fixer may spend on a document.

filter.ignore.mimeTypes

Specifies a list of MIME types for the filter to ignore.

filter.jsoup.classes

Specify which java/groovy classes will be used for filtering, and operate on JSoup objects rather than byte streams.

filter.jsoup.undesirable_text-separate-lists

Defines if undesirable text should be merged into a single list.

filter.jsoup.undesirable_text-source.[key_name]

Define sources of undesirable text to detect and present within content auditor.

filter.jsoup.undesirable_text.[key_name]

Specify words expressions of undesirable text to detect and present within content auditor.

filter.md_normaliser.keys

Defines the metadata normalizer rules that will be run by the MetadataNormaliser filter.

filter.noindex.[keyName]

Defines rules for hiding content from the Funnelback indexer when using the inject no-index filter.

filter.text-cleanup.ranges-to-replace

Specify Unicode blocks for replacement during filtering (to avoid 'corrupt' character display).

filter.tika.types

Specifies which file types to filter using the TikaFilterProvider.

filter_binaries

Specifies the location of the filter jar files.

flickr.api-key

Flickr API key

flickr.api-secret

Flickr API secret

flickr.auth-secret

Flickr authentication secret

flickr.auth-token

Flickr’s authentication token

flickr.debug

Enable debug mode to preview Flickr fetched records.

flickr.groups.private

List of Flickr group IDs to crawl within a "private" view.

flickr.groups.public

List of Flickr group IDs to crawl within a "public" view.

flickr.user-ids

Comma delimited list of Flickr user accounts IDs to crawl.

ftp_passwd

Password to use when gathering content from an FTP server.

ftp_user

Username to use when gathering content from an FTP server.

gather

Specifies if gathering is enabled or not.

gather.max_heap_size

Set Java heap size used for gathering documents.

gather.slowdown.days

Days on which gathering should be slowed down.

gather.slowdown.hours.from

Start hour for slowdown period.

gather.slowdown.hours.to

End hour for slowdown period.

gather.slowdown.request_delay

Request delay to use during slowdown period.

gather.slowdown.threads

Number of threads to use during slowdown period.

groovy.extra_class_path

Specify extra class paths to be used by Groovy when using $GROOVY_COMMAND.

group.customer_id

The customer group under which collection will appear - Useful for multi-tenant systems.

group.project_id

The project group under which collection will appear in selection drop down menu on main Administration page.

gscopes.options

Specifies options for the "padre-gs" gscopes program.

gscopes.other_gscope

Specifies the gscope to set when no other gscopes are set.

http_passwd

Password for accessing websites that use HTTP basic authentication.

http_proxy

This option sets the hostname of a proxy server to use for crawling.

http_proxy_passwd

This option sets the password (if required) used to authenticate with a proxy server used for crawling.

http_proxy_port

This option sets the port of a proxy server used for crawling.

http_proxy_user

This option sets the username (if required) used to authenticate with a proxy server used for crawling.

http_source_host

IP address or hostname used by crawler, on a machine with more than one available.

http_user

Username for accessing websites that use HTTP basic authentication.

include_patterns

Specifies the pattern that URLs must match in order to be crawled.

index

A switch that can be used to enable or disable the indexing stage during a data source update.

index.target

Used to indicate an alternate data source

indexer

The name of the indexer program to be used for this data source.

indexer_options

Options for configuring the Funnelback indexer, controlling what is indexed and how the index is built.

indexing.additional-metamap-source.[key_name]

Declares additional sources of metadata mappings to be used when indexing HTML documents.

indexing.collapse_fields

Defines the fields used for result collapsing.

indexing.use_manifest

Specifies if a manifest file should be used for indexing.

java_libraries

The path where the Java libraries are located when running most gatherers.

java_options

Command line options to pass to the Java virtual machine.

knowledge-graph.max_heap_size

Set Java heap size used for Knowledge Graph update process.

logging.hostname_in_filename

Control whether hostnames are used in log filenames.

logging.ignored_x_forwarded_for_ranges

Defines all IP ranges in the X-Forwarded-For header to be ignored by Funnelback when choosing the IP address to Log.

mail.on_failure_only

Specifies whether to always send data source update emails or only when an update fails.

matrix_password

Username for logging into Matrix and the Squiz Suite Manager.

matrix_username

Password for logging into Matrix and the Squiz Suite Manager.

mcf.authority-url

URL for contacting a ManifoldCF authority.

mcf.domain

Default domain for users in the ManifoldCF authority.

meta.components

List of data sources included in a search package.

meta.components.[component].weight

Sets the relative importance of the different data sources in a search package.

noindex_expression

(DEPRECATED) Optional regular expression to specify content that should not be indexed.

plugin.[pluginId].enabled

Specifies if a plugin is enabled or not.

plugin.[pluginId].encrypted.[secretKey]

Plugin configuration fields for storing secret information such as a password in an encrypted format.

plugin.[pluginId].version

Specifies the version of the plugin to use.

plugin.gather-with

Defines the plugin which will provide the custom gatherer to fetch documents with.

post_archive_command

Command to run after archiving query and click logs.

post_collection_create_command

Command to run after collection was created.

post_delete-list_command

Command to run after deleting documents during an instant delete update.

post_delete-prefix_command

Command to run after deleting documents during an instant delete update.

post_gather_command

Command to run after the gathering phase during a collection update.

post_index_command

Command to run after the index phase during a collection update.

post_instant-gather_command

Command to run after the gather phase during an instant update.

post_instant-index_command

Command to run after the index phase during an instant update.

post_meta_dependencies_command

Command to run after a component collection updates its meta parents during a collection update.

post_push_index_command

Command to run after the Push Index phase during a collection update.

post_recommender_command

Command to run after the recommender phase during a collection update.

post_reporting_command

Command to run after query analytics runs.

post_swap_command

Command to run after live and offline views are swapped during a collection update.

post_update_command

Command to run after an update has successfully completed.

pre_archive_command

Command to run before archiving query and click logs.

pre_delete-list_command

Command to run before deleting documents during an instant delete update.

pre_delete-prefix_command

Command to run before deleting documents during an instant delete update.

pre_gather_command

Command to run before the gathering phase during a data source update.

pre_index_command

Command to run before the index phase during a data source update.

pre_instant-gather_command

Command to run before the gather phase during an instant update.

pre_instant-index_command

Command to run before the index phase during an instant update.

pre_meta_dependencies_command

Command to run before a data source updates parent search packages during a data source update.

pre_push_index_command

Command to run before the push index phase during a data source update.

pre_recommender_command

Command to run before the recommender phase during a data source update.

pre_report_command

Command run before query or click logs are to be used during an update.

pre_reporting_command

Command to run before query analytics runs.

pre_swap_command

Command to run before live and offline views are swapped during a data source update.

progress_report_interval

Interval (in seconds) at which the gatherer will update the progress message in the administration dashboard.

push.auto-start

Specifies whether a push data source will automatically start with the web server.

push.commit-type

The type of commit that a push data source should use by default.

push.commit.index.parallel.max-index-thread-count

The maximum number of threads that can be used during a commit for indexing.

push.commit.index.parallel.min-documents-for-parallel-indexing

The minimum number of documents required in a single commit for parallel indexing to be used during that commit.

push.commit.index.parallel.min-documents-per-thread

The minimum number of documents each thread must have when using parallel indexing in a commit.

push.initial-mode

The initial mode in which push should start.

push.max-generations

Sets the maximum number of generations a push data source can create.

push.merge.index.parallel.max-index-thread-count

The maximum number of threads that can be used during a merge for indexing.

push.merge.index.parallel.min-documents-for-parallel-indexing

The minimum number of documents required in a single merge for parallel indexing to be used during that merge.

push.merge.index.parallel.min-documents-per-thread

The minimum number of documents each thread must have when using parallel indexing in a merge.

push.replication.compression-algorithm

The compression algorithm to use when transferring compressible files to push data source slaves.

push.replication.delay.error

Delay in checking the master node for changes after a check that returned an error.

push.replication.delay.fetched-data

Delay in checking the master node for changes after a successful data fetch.

push.replication.delay.no-new-data

Delay in checking the master node for changes after a check that detected no changes.

push.replication.delay.out-of-generations

Delay in checking the master node for changes after a check that returned an out of generations error.

push.replication.ignore.data

When set query processors will ignore the 'data' section in snapshots, which is used for serving cached copies.

push.replication.ignore.delete-lists

When set query processors will ignore the delete lists.

push.replication.ignore.index-redirects

When set query processors will ignore the index redirects file in snapshots.

push.replication.master.webdav.port

The WebDAV port of the master node.

push.run

Controls if a Push data source is allowed to run or not.

push.scheduler.auto-click-logs-processing-timeout-seconds

Number of seconds before a push data source will automatically trigger processing of click logs.

push.scheduler.auto-commit-timeout-seconds

Number of seconds a push data source should wait before a commit automatically triggers.

push.scheduler.changes-before-auto-commit

Number of changes to a push data source before a commit automatically triggers.

push.scheduler.delay-between-content-auditor-runs

Minimum time in milliseconds between each execution of the content auditor summary generation task.

push.scheduler.delay-between-meta-dependencies-runs

Minimum time in milliseconds between each execution of and update of the push data source’s parent search package.

push.scheduler.generation.re-index.killed-percent

The percentage of killed documents in a single generation for it to be considered for re-indexing.

push.scheduler.generation.re-index.min-doc-count

The minimum number of documents in a single generation for it to be considered for re-indexing.

push.scheduler.killed-percentage-for-reindex

Percentage of killed documents before automatic re-indexing of a push data source.

push.store.always-flush

Used to stop a push data source from performing caching on PUT or DELETE calls.

qie.default_weight

Specifies the default weighting for query independent evidence (QIE).

query_processor

The name of the query processor executable to use.

quicklinks

Turn quick links functionality on or off.

quicklinks.blacklist_terms

List of words to ignore as the link title.

quicklinks.depth

The number of sub-pages to search for link titles.

quicklinks.domain_searchbox

Turn on or off the inline domain restricted search box on the search result page.

quicklinks.max_len

Maximum character length for the link title.

quicklinks.max_words

Maximum number of link titles to display.

quicklinks.min_len

Minimum character length for the link title.

quicklinks.min_links

Minimum number of links to display.

quicklinks.rank

The number of search results to enable quick links on.

quicklinks.total_links

Total number of links to display.

recommender

Specifies if the recommendations system is enabled.

retry_policy.max_tries

Maximum number of times to retry a file copy operation that has failed.

rss.copyright

Sets the copyright element in the RSS feed

rss.ttl

Sets the ttl element in the RSS feed.

schedule.[taskType].auto.desired-time-between-updates

Specifies the desired time between tasks of the given type running.

schedule.[taskType].auto.no-update-window.duration

Specifies the duration of a window during which tasks of the given type will not be automatically scheduled.

schedule.[taskType].auto.no-update-window.start-time

Specifies the start time of a window during which tasks of the given type will not be automatically scheduled.

schedule.[taskType].fixed.permitted-days-of-week

Specifies a set of days of the week on which fixed start-time tasks for the given type will be automatically scheduled.

schedule.[taskType].fixed.start-times

Specifies a set of times at which tasks of the given type will be automatically scheduled.

schedule.incremental_crawl_ratio

The number of scheduled incremental crawls that are performed between each full crawl.

schedule.timezone

Specifies the timezone that applies to the scheduler configuration.

search_user

The email address to use for administrative purposes.

security.earlybinding.locks-keys-matcher.ldlibrarypath

Full path to security plugin library

security.earlybinding.locks-keys-matcher.locking_model

Locking model to security plugin library

security.earlybinding.locks-keys-matcher.name

Name of security plugin library that matches user keys with document locks at query time

security.earlybinding.user-to-key-mapper

Selected security plugin for translating usernames into lists of document keys.

security.earlybinding.user-to-key-mapper.cache-seconds

Number of seconds for which a user’s list of keys may be cached

security.earlybinding.user-to-key-mapper.groovy-class

Name of a custom Groovy class to use to translate usernames into lists of document keys

service_name

Human-readable name of the search package or data source.

slack.channel-names-to-exclude

List of Slack channel names to exclude from search.

slack.hostname

The hostname of the Slack instance.

slack.target-collection

Specify the push data source which messages from a Slack data source should be stored into.

slack.target-push-api

The push API endpoint to which Slack messages should be added.

slack.user-names-to-exclude

Slack usernames to exclude from search.

spelling.suggestion_lexicon_weight

Specify weighting to be given to suggestions from the lexicon relative to other sources.

spelling.suggestion_sources

Specify sources of information for generating spelling suggestions.

spelling.suggestion_threshold

Threshold which controls how suggestions are made.

spelling_enabled

Whether to enable spell checking in the search interface.

squizapi.target_url

URL of the Squiz Suite Manager for a Matrix data source.

start_url

A list of URLs from which the crawler will start crawling.

store.push.collection

Name of a push data source to push content into when using a PushStore or Push2Store.

store.push.host

Hostname of the machine to push documents to if using a PushStore or Push2Store.

store.push.password

The password to use when authenticating against push if using a PushStore or Push2Store.

store.push.port

Port that Push is configured to listen on (if using a PushStore).

store.push.url

The URL that the push API is located at (if using a Push2Store).

store.push.user

The username to use when authenticating against push if using a PushStore or Push2Store.

store.raw-bytes.class

Fully qualified classname of a raw bytes Store class to use.

store.record.type

This parameter defines the type of store that Funnelback uses to store its records.

store.temp.class

Fully qualified classname of a class to use for temporary storage.

store.xml.class

Fully qualified classname of an XML storage class to use

trim.collect_containers

Whether to collect the container of each TRIM records or not.

trim.database

The 2-digit identifier of the TRIM database to index.

trim.default_live_links

Whether search results links should point to a copy of TRIM document, or launch TRIM client.

trim.domain

Windows domain for the TrimPush crawl user.

trim.extracted_file_types

A list of file extensions that will be extracted from TRIM databases.

trim.filter_timeout

Timeout to apply when filtering binary documents.

trim.free_space_check_exclude

Volume letters to exclude from free space disk check.

trim.free_space_threshold

Minimal amount of free space on disk under which a TRIMPush crawl will stop.

trim.gather_direction

Whether to go forward or backward when gathering TRIM records.

trim.gather_end_date

The date at which to stop the gather process.

trim.gather_mode

Date field to use when selecting records (registered date or modified date).

trim.gather_start_date

The date from which newly registered or modified documents will be gathered.

trim.license_number

TRIM license number as found in the TRIM client system information panel.

trim.max_filter_errors

The maximum number of filtering errors to tolerate before stopping the crawl.

trim.max_size

The maximum size of record attachments to process.

trim.max_store_errors

The maximum number of storage errors to tolerate before stopping the gather.

trim.passwd

Password for the TRIMPush crawl user.

trim.properties_blacklist

List of properties to ignore when extracting TRIM records.

trim.push.collection

Specifies the push data source to store the extracted TRIM records in.

trim.request_delay

Milliseconds between TRIM requests (for a particular thread).

trim.stats_dump_interval

Interval (in seconds) at which statistics will be written to the monitor.log.

trim.store_class

Class to use to store TRIM records.

trim.threads

Number of simultaneous TRIM database connections to use.

trim.timespan

Interval to split the gather date range into.

trim.timespan.unit

Number of time spans to split the gather date range into.

trim.user

Username for the TRIMPush crawl user.

trim.userfields_blacklist

List of user fields to ignore when extracting TRIM records.

trim.verbose

Defines how verbose the TRIM crawl is.

trim.version

Configure the version of TRIM to be crawled.

trim.web_server_work_path

Location of the temporary folder used by TRIM to extract binary files.

trim.workgroup_port

The port on the TRIM workgroup server to connect to when gathering content from TRIM.

trim.workgroup_server

The name of the TRIM workgroup server to connect to when gathering content from TRIM.

twitter.debug

Enable debug mode to preview Twitter fetched records.

twitter.oauth.access-token

Twitter OAuth access token.

twitter.oauth.consumer-key

Twitter OAuth consumer key.

twitter.oauth.consumer-secret

Twitter OAuth consumer secret.

twitter.oauth.token-secret

Twitter OAuth token secret.

twitter.users

Comma delimited list of Twitter usernames to crawl.

ui.modern.content-auditor.count_urls

Define how deep into URLs Content Auditor users can navigate using facets.

ui.modern.content-auditor.date-modified.ok-age-years

Define how many years old a document may be before it is considered problematic.

ui.modern.content-auditor.duplicate_num_ranks

Define how many results should be considered in detecting duplicates for content auditor.

ui.modern.content-auditor.reading-grade.lower-ok-limit

Define the reading grade below which documents are considered problematic.

ui.modern.content-auditor.reading-grade.upper-ok-limit

Define the reading grade above which documents are considered problematic.

ui.modern.extra_searches

Configure extra searches to be aggregated with the main result data, when using the Modern UI.

ui.modern.extra_searches.[extraSearchId].query_processor_options

Defines additional query processor options to apply when running the specified extra search.

ui.modern.extra_searches.[extraSearchId].source

Configure extra search sourced from data source or search package

ui.modern.form.rss.content_type

Sets the content type of the RSS template.

ui.modern.padre_packet_compression_size

The number of bytes before the padre-sw search packet should be internally compressed to save memory within the jetty JVM.

ui.modern.padre_response_size_limit_bytes

Sets the maximum size of padre-sw responses to process.

ui_cache_disabled

Disable the cache controller from accessing any cached documents.

ui_cache_link

Base URL used by PADRE to link to the cached copy of a search result. Can be an absolute URL.

update-pipeline-groovy-pre-post-commands.max_heap_size

Set Java heap size used for groovy scripts in pre/post update commands.

update-pipeline.max_heap_size

Set Java heap size used for update pipelines.

update.restrict_to_host

Specify that data source updates should be restricted to only run on a specific host.

userid_to_log

Controls how logging of IP addresses is performed.

vector_search.synonym

Enable/disable the vector synonym setup (i.e.,execute padre-vector-synonym) in the indexing program.

vector_search.synonym_max_score

Maximum score for synonym search

vector_search.synonym_topk

The maximum number of words to return in vector search synonym list generation per entry

vital_servers

Changeover only happens if vital servers exist in the new crawl.

warc.compression

Control how content is compressed in a WARC file.

workflow.publish_hook

Name of the publish hook Perl script

workflow.publish_hook.batch

Name of the publish hook Perl script for batch transfer of files.

workflow.publish_hook.meta

Name of the publish hook Perl script that will be called each time a meta collection is modified

youtube.api-key

YouTube API key retrieved from the Google API console.

youtube.channel-ids

YouTube channel IDs to crawl.

youtube.debug

Enable debug mode to preview YouTube fetched records.

youtube.liked-videos

Enables fetching of YouTube videos liked by a channel ID.

youtube.playlist-ids

YouTube playlist IDs to crawl.