Data source and search package configuration options (collection.cfg)
Data source and search package configuration options are used to configure various features within Funnelback.
The configuration options can be edited using the data source or search package configuration editor.
Special variables $SEARCH_HOME and $COLLECTION_NAME contained within a key value are automatically expanded to the Funnelback installation path and the ID of the current data source or search package automatically.
|
Configuration options
The following table outlines the available data source and search package configuration options.
Option | Description |
---|---|
Turns modern accessibility checks on or off. |
|
|
Specifies how much time must have passed since the last time Accessibility Auditor data was recording before new data will be recorded. |
(deprecated) This option controls whether a search package or data source can be deleted from the administration dashboard. |
|
Specifies an email that will be emailed after each collection update. |
|
Length of time range (in days) the analytics data miner will go back from the current date when mining query and click log records. |
|
List of email addresses to which reports should be sent. |
|
How often to send query report summaries. |
|
Reports email sender address ( |
|
Minimum trend alert confidence value for which emails should be sent. |
|
Enables or disables trend alert notifications. |
|
Enable or disable hourly trend alert notifications. |
|
Set Java heap size used for analytics. |
|
Control the minimum number of occurrences of a query required before a day pattern can be detected. |
|
Control the day pattern detection threshold. |
|
Disable query spike detection (trend alerts) for a collection. |
|
Disable query spike detection for a profile |
|
Control the minimum number of occurrences of a query required before a hour pattern can be detected. |
|
Control the hour pattern detection threshold. |
|
Controls the rate at which the query reports system checkpoints data to disk. |
|
Disable incremental reports database updates. If set all existing query and click logs will be processed for each reports update. |
|
Specifies the amount of data that is stored by query reports. |
|
Control whether reports for the search package are updated on a scheduled basis. |
|
Specify options for the "annie-a" annotation indexing program. |
|
Specifies additional configuration options that can be supplied when building auto-completion. |
|
Specifies a minimum ratio of documents that must be gathered for an update to succeed. |
|
The number of archived click logs to use from each archive directory. |
|
A boolean value indicating if click information should be included in the index. |
|
Optional restriction of click data to a set number of weeks into the past. |
|
The internal name of a data source or search package. |
|
Determines if an update step should be run or not. |
|
Specifies the type of the data source. |
|
Exclude contextual navigation suggestions which end with specified words or phrases. |
|
List of words to preserve case sensitivity. |
|
Group contextual navigation suggestions into types and topics. |
|
Enable or disable contextual navigation. |
|
Exclude contextual navigation suggestions which contain any words or phrases in this list. |
|
Limit the maximum length of suggestions to the specified number of words. |
|
Limit the number of contextual navigation phrases that should be processed. |
|
Specify the maximum number of results to examine when generating suggestions. |
|
Type of aggregation to be used for contextual navigation |
|
Defines the maximum number of site suggestions to return in contextual navigation. |
|
Metadata classes that are analysed for contextual navigation. |
|
Timeout for contextual navigation processing. |
|
Defines the maximum number of |
|
Defines the maximum number of |
|
Specifies the name of the crawler binary. |
|
This option enables, or disables, the crawler’s use of cookies. |
|
Restricts the file extensions the web crawler should crawl. |
|
Specifies the regular expression to extract link using the URL features (see below of the detailed conditions by default). |
|
Enable/disable concurrent processing of in-crawl form interaction. |
|
Specify a regex to allow crawler redirects that would otherwise be disallowed by the current include/exclude patterns. |
|
Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements. |
|
Maximum size of LRUCache. Upon reaching this size the cache will drop old elements. |
|
Specifies the maximum size of URLCache. |
|
Check if aliased URLs exists - if not, revert to original URL. |
|
Specifies the java class used for the frontier (a list of URLs not yet visited). |
|
Specifies the Java class used for enforcing the include/exclude policy for URLs. |
|
Specifies the Java class used for enforcing the revisit policy for URLs. |
|
Specifies the Java class used to store content on disk e.g. create a mirror of files crawled |
|
Whether to eliminate duplicate documents while crawling. |
|
Whether to extract links from Javascript while crawling. |
|
Whether to follow links in HTML comments while crawling. |
|
|
Specifies a clear text form parameter for in-crawl authentication. |
|
Specifies an encrypted form parameter for in-crawl authentication. |
Specifies a URL of a HTML web form action for an in-crawl form interaction rule. |
|
|
Specifies a clear text form parameter for pre-crawl authentication. |
|
Specifies an encrypted form parameter for pre-crawl authentication. |
Specifies which form element at the specified URL should be processed. |
|
Specifies a URL of the page containing the HTML web form for a pre-crawl form interaction rule. |
|
Lists of hosts running crawlers if performing a distributed web crawl. |
|
Specifies the number of top level directories to store disk based frontier files in. |
|
Port on which DistributedFrontier will listen on. |
|
Whether to map hosts to frontiers based on IP address. |
|
Option to control whether HTTP headers are written out to a separate log file. |
|
Whether to ignore the canonical link(s) on a page. |
|
Configures the crawler to ignore robots meta tag "nofollow" and rel="nofollow" directives during a crawl. |
|
Enables or disables the web crawler’s robots.txt support. |
|
Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling. |
|
Option to control whether text extraction from binary files is done "inline" during a web crawl. |
|
The group in the |
|
Specifies the regular expression used to extract links from each document. |
|
Whether to lowercase all URLs from IIS web servers. |
|
Specifies the maximum number of subdirectories a URL may have before it will be ignored. |
|
Specifies the maximum size of files the crawler will download (in MB). |
|
Specifies a limit on the number of files from a single directory or dynamically generated URLs that will be crawled. |
|
Specifies the maximum number of files that will be crawled per server. |
|
Specifies the maximum number of files to download. |
|
Specifies the maximum size of an individual frontier. |
|
Specifies the maximum distance a URL can be from a start URL for it to be downloaded. |
|
Sets the maximum size of documents parsed by the crawler. |
|
Maximum number of times to retry after a network timeout. |
|
Specifies the maximum length a URL can be in order for it to be crawled. |
|
A URL with more than this many repeating elements (directories) will be ignored. |
|
Specifies the time interval at which to renew crawl authentication cookies. |
|
Time interval at which to checkpoint (seconds). |
|
Type of delay to use during crawl (dynamic or fixed). |
|
Specifies an optional list of servers to prefer during crawling. |
|
Specifies a time interval at which to output monitoring information (seconds). |
|
Optional parameter listing URLs to reject during a running crawl. |
|
Which non-html file formats to crawl (e.g. pdf, doc, xls etc.). |
|
NTLM domain to be used for web crawler authentication. |
|
NTLM password to be used for web crawler authentication. |
|
NTLM username to be used for web crawler authentication. |
|
Number of crawler threads which simultaneously crawl different hosts. |
|
Specifies the maximum time the crawler is allowed to run. When exceeded, the crawl will stop and the update will continue. |
|
Specifies the units for the crawl timeout. |
|
Extract links from a list of content-types. |
|
Enable crawler predirects. |
|
Crawl URLs via these protocols. |
|
Do not crawl files with these extensions. |
|
Optional list of parameters to remove from URLs. |
|
Milliseconds between HTTP requests per crawler thread. |
|
Optional additional header to be inserted in HTTP(S) requests made by the webcrawler. |
|
Optional URL prefix to be applied when processing the |
|
Timeout for HTTP page GETs (milliseconds) |
|
Threshold for edit distance between two versions of a page when deciding whether it has changed or not. |
|
Threshold for number of times a page revisit has been skipped when deciding whether to revisit it. |
|
Threshold for the number of times a page has been unchanged when deciding whether to revisit it. |
|
Matching is case-insensitive over the length of the name in a robots.txt file |
|
Specifies whether HTTP basic credentials should be sent without the web server sending a challenge. |
|
Specifies a path to an SSL client certificate store. |
|
Password for the SSL client certificate store. |
|
Trust all root certificates and ignore server hostname verification. |
|
Specifies the path to a SSL trusted root store. |
|
Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. |
|
If true, override accept/reject rules and crawl and store all file types encountered |
|
Specifies if URLs that contain no content after filtering should be stored. |
|
Whether HTTP header information should be written at the top of HTML files. |
|
Specifies crawler type to use |
|
Whether to extract links based on the URL features while crawling. |
|
Specifies whether to process sitemap.xml files during a web crawl. |
|
The browser ID that the crawler uses when making HTTP requests. |
|
Verbosity level (0-7) of crawler logs. Higher number results in more messages. |
|
Specifies the location of the crawler files. |
|
The template used when a custom data source was created. |
|
A switch that can be used to enable or disable the data report stage during a data source update. |
|
Allows storage of data extracted from a database in a compressed form. |
|
(DEPRECATED) Allows a custom java class to modify data extracted from a database before indexing. |
|
The SQL query to perform on a database to fetch all records for searching. |
|
The SQL query to perform to fetch new or changed records from a database. |
|
Allows the selection of different modes for keeping database collections up to date. |
|
The name of the Java JDBC driver to connect to a database. |
|
The URL specifying database connection parameters such as the server and database name. |
|
The password for connecting to the database. |
|
The primary id (unique identifier) column for each database record. |
|
An SQL command for extracting an individual record from the database |
|
The name of a table in the database which provides a record of all additions, updates and deletes. |
|
Flag to control whether column labels are used in JDBC calls in the database gatherer. |
|
The username for connecting to the database. |
|
The top level element for records extracted from the database. |
|
Sets the java class to use for creating directory connections. |
|
Sets the domain to use for authentication in a directory data source. |
|
Sets the rules for excluding content from a directory collection. |
|
Sets the number of documents to fetch from the directory in each request. |
|
Sets the password to use for authentication in a directory data source. |
|
Sets the URL for accessing the directory in a directory data source. |
|
Sets the base from which content will be gathered in a directory data source. |
|
Sets the filter for selecting content to gather in a directory data source. |
|
Sets the username to use for authentication in a directory data source. |
|
The crawler will ignore a URL if it matches any of these exclude patterns. |
|
Specify an optional access token |
|
Specifies the Facebook application ID. |
|
Specifies the Facebook application secret. |
|
Enable debug mode to preview Facebook fetched records. |
|
Specify a list of Facebook event fields as specified in the Facebook event API documentation |
|
Specify a list of Facebook page fields as specified in the Facebook page API documentation |
|
Specifies a list of IDs of the Facebook pages/accounts to gather. |
|
Specify a list of Facebook post fields as specified in the Facebook post API documentation |
|
Exclude specific values for facets. |
|
Exclude specific values for a specific facet. |
|
Include only a list of specific values for facets. |
|
Include only a list of specific values for a specific facet. |
|
Enable/disable using the live view as a cache directory where pre-filtered text content can be copied from. |
|
Whether to index the file names of files that failed to be filtered. |
|
Filecopy sources that require a username to access files will use this setting as a domain for the user. |
|
File system data sources will exclude files which match this regular expression. |
|
The list of filetypes (i.e. file extensions) that will be included by a file system data source. |
|
If specified, file system data sources will only include files which match this regular expression. |
|
If set, this limits the number of documents a file system data source will gather when updating. |
|
Number of fetcher threads for interacting with the source file system when running a file system data source update. |
|
Number of worker threads for filtering and storing files in a file system data source. |
|
File system data sources that require a password to access files will use this setting as a password. |
|
Specifies how long to delay between copy requests in milliseconds. |
|
Sets the plugin to use to collect security information on files. |
|
This is the file system path or URL that describes the source of data files. |
|
If specified, this option is set to a file which contains a list of other files to copy, rather than using the |
|
Specifies which storage class to be used by a file system data source (e.g. WARC, Mirror). |
|
Filecopy sources that require a username to access files will use this setting as a username. |
|
Main class used by the file system data source to walk a file tree. |
|
Specifies which java classes should be used for filtering documents. |
|
Defines a custom header to use for the CSV. |
|
Sets the CSV format to use when filtering a CSV document. |
|
Controls if the CSV file has a header or not. |
|
The template to use for the URLs of the documents created in the CSVToXML Filter. |
|
Controls the maximum amount of time the document fixer may spend on a document. |
|
Specifies a list of MIME types for the filter to ignore. |
|
Specify which java/groovy classes will be used for filtering, and operate on JSoup objects rather than byte streams. |
|
Defines if undesirable text should be merged into a single list. |
|
Define sources of undesirable text to detect and present within content auditor. |
|
Specify words expressions of undesirable text to detect and present within content auditor. |
|
Defines the metadata normalizer rules that will be run by the MetadataNormaliser filter. |
|
Defines rules for hiding content from the Funnelback indexer when using the inject no-index filter. |
|
Specify Unicode blocks for replacement during filtering (to avoid 'corrupt' character display). |
|
Specifies which file types to filter using the TikaFilterProvider. |
|
Flickr API key |
|
Flickr API secret |
|
Flickr authentication secret |
|
Flickr authentication token |
|
Enable debug mode to preview Flickr fetched records. |
|
List of Flickr group IDs to crawl within a "private" view. |
|
List of Flickr group IDs to crawl within a "public" view. |
|
Comma delimited list of Flickr user accounts IDs to crawl. |
|
Password to use when gathering content from an FTP server. |
|
Username to use when gathering content from an FTP server. |
|
Specifies if gathering is enabled or not. |
|
Set Java heap size used for gathering documents. |
|
Days on which gathering should be slowed down. |
|
Start hour for slowdown period. |
|
End hour for slowdown period. |
|
Request delay to use during slowdown period. |
|
Number of threads to use during slowdown period. |
|
Specify extra class paths to be used by Groovy when using $GROOVY_COMMAND. |
|
The customer group under which collection will appear - Useful for multi-tenant systems. |
|
The project group under which collection will appear in selection drop down menu on main Administration page. |
|
Specifies options for the "padre-gs" gscopes program. |
|
Specifies the gscope to set when no other gscopes are set. |
|
Password for accessing websites that use HTTP basic authentication. |
|
This option sets the hostname of a proxy server to use for crawling. |
|
This option sets the password (if required) used to authenticate with a proxy server used for crawling. |
|
This option sets the port of a proxy server used for crawling. |
|
This option sets the username (if required) used to authenticate with a proxy server used for crawling. |
|
IP address or hostname used by crawler, on a machine with more than one available. |
|
Username for accessing websites that use HTTP basic authentication. |
|
Specifies the pattern that URLs must match in order to be crawled. |
|
A switch that can be used to enable or disable the indexing stage during a data source update. |
|
Used to indicate an alternate data source |
|
The name of the indexer program to be used for this data source. |
|
Options for configuring the Funnelback indexer, controlling what is indexed and how the index is built. |
|
Declares additional sources of metadata mappings to be used when indexing HTML documents. |
|
Defines the fields used for result collapsing. |
|
Specifies if a manifest file should be used for indexing. |
|
The path where the Java libraries are located when running most gatherers. |
|
Command line options to pass to the Java virtual machine. |
|
Set Java heap size used for Knowledge Graph update process. |
|
Control whether hostnames are used in log filenames. |
|
Defines all IP ranges in the |
|
Specifies whether to always send data source update emails or only when an update fails. |
|
Username for logging into Matrix and the Squiz Suite Manager. |
|
Password for logging into Matrix and the Squiz Suite Manager. |
|
URL for contacting a ManifoldCF authority. |
|
Default domain for users in the ManifoldCF authority. |
|
List of data sources included in a search package. |
|
Sets the relative importance of the different data sources in a search package. |
|
(DEPRECATED) Optional regular expression to specify content that should not be indexed. |
|
Specifies if a plugin is enabled or not. |
|
Plugin configuration fields for storing secret information such as a password in an encrypted format. |
|
Specifies the version of the plugin to use. |
|
Defines the plugin which will provide the custom gatherer to fetch documents with. |
|
Command to run after archiving query and click logs. |
|
Command to run after collection was created. |
|
Command to run after deleting documents during an instant delete update. |
|
Command to run after deleting documents during an instant delete update. |
|
Command to run after the gathering phase during a collection update. |
|
Command to run after the index phase during a collection update. |
|
Command to run after the gather phase during an instant update. |
|
Command to run after the index phase during an instant update. |
|
Command to run after a component collection updates its meta parents during a collection update. |
|
Command to run after the Push Index phase during a collection update. |
|
Command to run after the recommender phase during a collection update. |
|
Command to run after query analytics runs. |
|
Command to run after live and offline views are swapped during a collection update. |
|
Command to run after an update has successfully completed. |
|
Command to run before archiving query and click logs. |
|
Command to run before deleting documents during an instant delete update. |
|
Command to run before deleting documents during an instant delete update. |
|
Command to run before the gathering phase during a data source update. |
|
Command to run before the index phase during a data source update. |
|
Command to run before the gather phase during an instant update. |
|
Command to run before the index phase during an instant update. |
|
Command to run before a data source updates parent search packages during a data source update. |
|
Command to run before the push index phase during a data source update. |
|
Command to run before the recommender phase during a data source update. |
|
Command run before query or click logs are to be used during an update. |
|
Command to run before query analytics runs. |
|
Command to run before live and offline views are swapped during a data source update. |
|
Interval (in seconds) at which the gatherer will update the progress message in the administration dashboard. |
|
Specifies whether a push data source will automatically start with the web server. |
|
The type of commit that a push data source should use by default. |
|
The maximum number of threads that can be used during a commit for indexing. |
|
|
The minimum number of documents required in a single commit for parallel indexing to be used during that commit. |
The minimum number of documents each thread must have when using parallel indexing in a commit. |
|
The initial mode in which push should start. |
|
Sets the maximum number of generations a push data source can create. |
|
The maximum number of threads that can be used during a merge for indexing. |
|
|
The minimum number of documents required in a single merge for parallel indexing to be used during that merge. |
The minimum number of documents each thread must have when using parallel indexing in a merge. |
|
The compression algorithm to use when transferring compressible files to push data source slaves. |
|
Delay in checking the master node for changes after a check that returned an error. |
|
Delay in checking the master node for changes after a successful data fetch. |
|
Delay in checking the master node for changes after a check that detected no changes. |
|
Delay in checking the master node for changes after a check that returned an out of generations error. |
|
When set query processors will ignore the 'data' section in snapshots, which is used for serving cached copies. |
|
When set query processors will ignore the delete lists. |
|
When set query processors will ignore the index redirects file in snapshots. |
|
The WebDAV port of the master node. |
|
Controls if a Push data source is allowed to run or not. |
|
Number of seconds before a push data source will automatically trigger processing of click logs. |
|
Number of seconds a push data source should wait before a commit automatically triggers. |
|
Number of changes to a push data source before a commit automatically triggers. |
|
Minimum time in milliseconds between each execution of the content auditor summary generation task. |
|
Minimum time in milliseconds between each execution of and update of the push data source’s parent search package. |
|
The percentage of killed documents in a single generation for it to be considered for re-indexing. |
|
The minimum number of documents in a single generation for it to be considered for re-indexing. |
|
Percentage of killed documents before automatic re-indexing of a push data source. |
|
Used to stop a push data source from performing caching on PUT or DELETE calls. |
|
Specifies the default weighting for query independent evidence (QIE). |
|
The name of the query processor executable to use. |
|
Turn quick links functionality on or off. |
|
List of words to ignore as the link title. |
|
The number of sub-pages to search for link titles. |
|
Turn on or off the inline domain restricted search box on the search result page. |
|
Maximum character length for the link title. |
|
Maximum number of link titles to display. |
|
Minimum character length for the link title. |
|
Minimum number of links to display. |
|
The number of search results to enable quick links on. |
|
Total number of links to display. |
|
Specifies if the recommendations system is enabled. |
|
Maximum number of times to retry a file copy operation that has failed. |
|
Sets the copyright element in the RSS feed |
|
Sets the ttl element in the RSS feed. |
|
Specifies the desired time between tasks of the given type running. |
|
Specifies the duration of a window during which tasks of the given type will not be automatically scheduled. |
|
Specifies the start time time of a window during which tasks of the given type will not be automatically scheduled. |
|
Specifies a set of days of the week on which fixed start-time tasks for the given type will be automatically scheduled. |
|
Specifies a set of times at which tasks of the given type will be automatically scheduled. |
|
The number of scheduled incremental crawls that are performed between each full crawl. |
|
Specifies the timezone that applies to the scheduler configuration. |
|
The email address to use for administrative purposes. |
|
Full path to security plugin library |
|
Locking model to security plugin library |
|
Name of security plugin library that matches user keys with document locks at query time |
|
Selected security plugin for translating usernames into lists of document keys. |
|
Number of seconds for which a users’s list of keys may be cached |
|
Name of a custom Groovy class to use to translate usernames into lists of document keys |
|
Human readable name of the search package or data source. |
|
List of Slack channel names to exclude from search. |
|
The hostname of the Slack instance. |
|
Specify the push data source which messages from a Slack data source should be stored into. |
|
The push API endpoint to which slack messages should be added. |
|
Slack user names to exclude from search. |
|
Specify weighting to be given to suggestions from the lexicon relative to other sources. |
|
Specify sources of information for generating spelling suggestions. |
|
Threshold which controls how suggestions are made. |
|
Whether to enable spell checking in the search interface. |
|
URL of the Squiz Suite Manager for a Matrix data source. |
|
A list of URLs from which the crawler will start crawling. |
|
Name of a push data source to push content into when using a PushStore or Push2Store. |
|
Hostname of the machine to push documents to if using a PushStore or Push2Store. |
|
The password to use when authenticating against push if using a PushStore or Push2Store. |
|
Port that Push is configured to listen on (if using a PushStore). |
|
The URL that the push API is located at (if using a Push2Store). |
|
The username to use when authenticating against push if using a PushStore or Push2Store. |
|
Fully qualified classname of a raw bytes Store class to use. |
|
This parameter defines the type of store that Funnelback uses to store its records. |
|
Fully qualified classname of a class to use for temporary storage. |
|
Fully qualified classname of an XML storage class to use |
|
Whether to collect the container of each TRIM records or not. |
|
The 2-digit identifier of the TRIM database to index. |
|
Whether search results links should point to a copy of TRIM document, or launch TRIM client. |
|
Windows domain for the TrimPush crawl user. |
|
A list of file extensions that will be extracted from TRIM databases. |
|
Timeout to apply when filtering binary documents. |
|
Volume letters to exclude from free space disk check. |
|
Minimal amount of free space on disk under which a TRIMPush crawl will stop. |
|
Whether to go forward or backward when gathering TRIM records. |
|
The date at which to stop the gather process. |
|
Date field to use when selecting records (registered date or modified date). |
|
The date from which newly registered or modified documents will be gathered. |
|
TRIM license number as found in the TRIM client system information panel. |
|
The maximum number of filtering errors to tolerate before stopping the crawl. |
|
The maximum size of record attachments to process. |
|
The maximum number of storage errors to tolerate before stopping the gather. |
|
Password for the TRIMPush crawl user. |
|
List of properties to ignore when extracting TRIM records. |
|
Specifies the push data source to store the extracted TRIM records in. |
|
Milliseconds between TRIM requests (for a particular thread). |
|
Interval (in seconds) at which statistics will be written to the |
|
Class to use to store TRIM records. |
|
Number of simultaneous TRIM database connections to use. |
|
Interval to split the gather date range into. |
|
Number of time spans to split the gather date range into. |
|
Username for the TRIMPush crawl user. |
|
List of user fields to ignore when extracting TRIM records. |
|
Defines how verbose the TRIM crawl is. |
|
Configure the version of TRIM to be crawled. |
|
Location of the temporary folder used by TRIM to extract binary files. |
|
The port on the TRIM workgroup server to connect to when gathering content from TRIM. |
|
The name of the TRIM workgroup server to connect to when gathering content from TRIM. |
|
Enable debug mode to preview Twitter fetched records. |
|
Twitter OAuth access token. |
|
Twitter OAuth consumer key. |
|
Twitter OAuth consumer secret. |
|
Twitter OAuth token secret. |
|
Comma delimited list of Twitter user names to crawl. |
|
Define how deep into URLs Content Auditor users can navigate using facets. |
|
Define how many years old a document may be before it is considered problematic. |
|
Define how many results should be considered in detecting duplicates for content auditor. |
|
Define the reading grade below which documents are considered problematic. |
|
Define the reading grade above which documents are considered problematic. |
|
Configure extra searches to be aggregated with the main result data, when using the Modern UI. |
|
|
Defines additional query processor options to apply when running the specified extra search. |
Configure extra search sourced from data source or search package |
|
Sets the content type of the RSS template. |
|
The number of bytes before the |
|
Sets the maxmimum size of padre-sw responses to process. |
|
Disable the cache controller from accessing any cached documents. |
|
Base URL used by PADRE to link to the cached copy of a search result. Can be an absolute URL. |
|
Set Java heap size used for groovy scripts in pre/post update commands. |
|
Set Java heap size used for update pipelines. |
|
Specify that data source updates should be restricted to only run on a specific host. |
|
Controls how logging of IP addresses is performed. |
|
Changeover only happens if vital servers exist in the new crawl. |
|
Control how content is compressed in a WARC file. |
|
Name of the publish hook Perl script |
|
Name of the publish hook Perl script for batch transfer of files. |
|
Name of the publish hook Perl script that will be called each time a meta collection is modified |
|
YouTube API key retrieved from the Google API console. |
|
YouTube channel IDs to crawl. |
|
Enable debug mode to preview YouTube fetched records. |
|
Enables fetching of YouTube videos liked by a channel ID. |
|
YouTube playlist IDs to crawl. |