Skip to content

Web crawler configuration options

Web crawler options

The web crawler has a comprehensive set of configuration options that can be used to adjust how the web crawler operates.

General options

Option Description Default
crawler.num_crawlers Number of crawler threads which simultaneously crawl different hosts. 20
crawler.request_delay Milliseconds between HTTP requests (for a specific crawler thread). 250
crawler.user_agent The user agent that the web crawler identifies as uses when making HTTP requests. Mozilla/5.0 (compatible; Funnelback)
crawler.server_alias_file Path to optional file containing server alias mappings. See: server aliases
crawler.classes.RevisitPolicy Java class used for enforcing the revisit policy for URLs com.funnelback.common.revisit.AlwaysRevisitPolicy
crawler.revisit.edit_distance_threshold Threshold for edit distance between two versions of a page when deciding whether it has changed or not when using the SimpleRevisitPolicy. 20
crawler.revisit.num_times_revisit_skipped_threshold Threshold for number of times a page revisit has been skipped when deciding whether to revisit it when using the SimpleRevisitPolicy. 2
crawler.revisit.num_times_unchanged_threshold Threshold for number of times a page has been unchanged when deciding whether to revisit it when using the SimpleRevisitPolicy. 5
data_report Specifies if data reports should be generated for the crawl. true
vital_servers Specifies a list of servers that must be present in the crawl for a successful update.

Options controlling what gets included

Option Description Default
include_patterns URLs matching this are included in crawl (unless they match any exclude_patterns).
exclude_patterns URLs matching this are excluded from the crawl. /cgi-bin,/vti,/_vti,calendar,SQ_DESIGN_NAME=print,SQ_ACTION=logout,SQ_PAINT_LAYOUT_NAME=,%3E%3C/script%3E,
crawler.use_sitemap_xml Specifies if sitemap.xml files should be processed during a web crawl. false
crawler.start_urls_file Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the start_url that the crawler is passed on startup (usually stored in the crawler.start_url configuration option). collection.cfg.start.urls
start_url Crawler seed URL. Crawler follows links in this page, and then the links of those pages and so on. _disabled__see_start_urls_file
crawler.protocols Crawl URLs via these protocols (comma separated list). http,https
crawler.reject_files Do not crawl files with these extensions. asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip,Z
crawler.accept_files Only crawl files with these extensions. Not normally used - default is to accept all valid content.
crawler.store_all_types If true, override accept/reject rules and crawl and store all file types encountered. false
crawler.store_empty_content_urls If true, store URLs even if, after filtering, they contain no content. false
crawler.non_html Specifies non-html file formats to filter, based on the file extension (e.g. pdf, doc, xls) doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm
crawler.allowed_redirect_pattern Specify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns.
Option Description Default
crawler.parser.mimeTypes Extract links from these comma-separated or regexp: content-types. text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml
crawler.extract_links_from_javascript Whether to extract links from Javascript while crawling. false
crawler.follow_links_in_comments Whether to follow links in HTML comments while crawling. false
crawler.link_extraction_group The group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL.
crawler.link_extraction_regular_expression The expression used to extract links from each document. This must be a Perl compatible regular expression.

Options controlling size limits and timeouts

Option Description Default
crawler.max_dir_depth A URL with more than this many sub directories will be ignored (too deep, probably a crawler trap) 15
crawler.max_download_size Maximum size of files crawler will download (in MB). Default: 10MB 10
crawler.max_files_per_area Maximum files per area e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size 10000
crawler.max_files_per_server Maximum files per server (default (empty) is unlimited)
crawler.max_files_stored Maximum number of files to download (default, and less than 1, is unlimited)
crawler.max_link_distance How far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation.
crawler.max_parse_size Crawler will not parse documents beyond this many megabytes in size 10
crawler.max_url_length A URL with more characters than this will be ignored (too long, probably a crawler trap) 256
crawler.max_url_repeating_elements A URL with more than this many repeating elements (directories) will be ignored (probably a crawler trap or incorrectly configured web server) 5
crawler.overall_crawl_timeout Maximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter. 24
crawler.overall_crawl_units The units for the crawler.overall_crawl_timeout parameter. A value of hr indicates hours and min indicates minutes. hr
crawler.request_timeout Timeout for HTTP page GETs (milliseconds) 15000
crawler.max_timeout_retries Maximum number of times to retry after a network timeout (default is 0) 0

Authentication options

Option Description Default
crawler.allow_concurrent_in_crawl_form_interaction Enable/disable concurrent processing of in-crawl form interaction. true
crawler.form_interaction.pre_crawl.groupId.url Specify a URL of the page containing the HTML web form in pre_crawl authentication mode
crawler.form_interaction.in_crawl.groupId.url_pattern Specify a URL or URL pattern of the page containing the HTML web form in in_crawl authentication mode
crawler.ntlm.domain NTLM domain to be used for web crawler authentication.
crawler.ntlm.password NTLM password to be used for web crawler authentication.
crawler.ntlm.username NTLM username to be used for web crawler authentication.
ftp_passwd Password to use when gathering content from an FTP server.
ftp_user Username to use when gathering content from an FTP server.
http_passwd Password used for accessing password protected content during a crawl when.
http_user Username used for accessing password protected content during a crawl.

Web crawler monitor options

Option Description Default
crawler.monitor_authentication_cookie_renewal_interval Optional time interval at which to renew crawl authentication cookies
crawler.monitor_checkpoint_interval Time interval at which to checkpoint (seconds). 1800
crawler.monitor_delay_type Type of delay to use during crawl (dynamic or fixed). dynamic
crawler.monitor_halt Checked during a crawl - if set to true then crawler will cleanly shutdown. false
crawler.monitor_preferred_servers_list Optional list of servers to prefer during crawl.
crawler.monitor_time_interval Time interval at which to output monitoring information (seconds). 30
crawler.monitor_url_reject_list Optional parameter listing URLs to reject during a running crawl.

HTTP options

Option Description Default
http_proxy The hostname (e.g. of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.
http_proxy_passwd The proxy password to be used during crawling.
http_proxy_port Port of HTTP proxy used during crawling.
http_proxy_user The proxy user name to be used during crawling.
http_source_host IP address or hostname used by crawler, on a machine with more than one available.
crawler.request_header Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefix Optional URL prefix to be applied when processing the crawler.request_header parameter.
crawler.store_headers Write HTTP header information at top of HTML files if true. Header information is used by indexer. true

Logging options

Option Description Default
crawler.verbosity Verbosity level (0-6) of crawler logs. Higher number results in more messages. 4
crawler.header_logging Option to control whether HTTP headers are written out to a separate log file (default is false). false
crawler.incremental_logging Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling false
crawler.logfile The crawler's log path and filename. $SEARCH_HOME/data/$COLLECTION_NAME/offline/log/crawl.log

Web crawler advanced options

Option Description Default
crawler The name of the crawler binary. com.funnelback.crawler.FunnelBack
crawler_binaries Location of the crawler files.
crawler.accept_cookies Cookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true. true
crawler.cache.DNSCache_max_size Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements. 200000
crawler.cache.LRUCache_max_size Maximum size of LRUCache. Upon reaching this size the cache will drop old elements. 500000
crawler.cache.URLCache_max_size Maximum size of URLCache. May be ignored by some cache implementations. 50000000
crawler.check_alias_exists Check if aliased URLs exists - if not, revert back to original URL false
crawler.checkpoint_to Location of crawler checkpoint files. $SEARCH_HOME/data/$COLLECTION_NAME/offline/checkpoint
crawler.classes.Crawler Java class used by crawler - defines top level behaviour, which protocols are supported etc. com.funnelback.crawler.NetCrawler
crawler.classes.Frontier Java class used for the frontier (a list of URLs not yet visited).
crawler.classes.Policy Java class used for enforcing the include/exclude policy for URLs com.funnelback.crawler.StandardPolicy
crawler.classes.statistics List of statistics classes to use during a crawl in order to generate figures for data reports CrawlSizeStatistic,MIMETypeStatistic,BroadMIMETypeStatistic,FileSizeStatistic,FileSizeByDocumentTypeStatistic,SuffixTypeStatistic,ReferencedFileTypeStatistic,URLlengthStatistic,WebServerTypeStatistic,BroadWebServerTypeStatistic
crawler.classes.URLStore Java class used to store content on disk e.g. create a mirror of files crawled.
crawler.cookie_jar_file File containing cookies to be pre-loaded when a web crawl begins. $SEARCH_HOME/conf/$COLLECTION_NAME/cookies.txt
crawler.eliminate_duplicates Whether to eliminate duplicate documents while crawling (default is true) true
crawler.frontier_num_top_level_dirs Optional setting to specify number of top level directories to store disk based frontier files in.
crawler.frontier_use_ip_mapping Whether to map hosts to frontiers based on IP address. (default is false) false
crawler.frontier_hosts Lists of hosts running crawlers if performing a distributed web crawl.
crawler.frontier_port Port on which DistributedFrontier will listen on.
crawler.max_individual_frontier_size Maximum size of an individual frontier (unlimited if not defined)
crawler.inline_filtering_enabled Option to control whether text extraction from binary files is done inline during a web crawl true
crawler.lowercase_iis_urls Whether to lowercase all URLs from IIS web servers (default is false) false
crawler.predirects_enabled Enable crawler predirects (boolean). See: crawler predirects
crawler.remove_parameters Optional list of parameters to remove from URLs.
crawler.robotAgent Matching is case-insensitive over the length of the name in a robots.txt file. Funnelback
crawler.secondary_store_root Location of secondary (previous) store - used in incremental crawling $SEARCH_HOME/data/$COLLECTION_NAME/live/data
crawler.sslClientStore Path to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java's keytool.
crawler.sslClientStorePassword Password for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java's keytool.
crawler.sslTrustEveryone Trust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing. true
crawler.sslTrustStore Path to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java's keytool.
crawler.send-http-basic-credentials-without-challenge This option controls whether or not Funnelback sends any HTTP credentials along with every request. true
schedule.incremental_crawl_ratio The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl). 10

See also


Funnelback logo