Web crawler configuration options

Web crawler options

The web crawler has a comprehensive set of configuration options that can be used to adjust how the web crawler operates.

General options

Option Description Default

crawler.num_crawlers

Number of crawler threads which simultaneously crawl different hosts.

20

crawler.request_delay

Milliseconds between HTTP requests (for a specific crawler thread).

250

crawler.user_agent

The user agent that the web crawler identifies as uses when making HTTP requests.

Mozilla/5.0 (compatible; Funnelback)

crawler.classes.RevisitPolicy

Java class used for enforcing the revisit policy for URLs

com.funnelback.common.revisit.AlwaysRevisitPolicy

crawler.revisit.edit_distance_threshold

Threshold for edit distance between two versions of a page when deciding whether it has changed or not when using the SimpleRevisitPolicy.

20

crawler.revisit.num_times_revisit_skipped_threshold

Threshold for number of times a page revisit has been skipped when deciding whether to revisit it when using the SimpleRevisitPolicy.

2

crawler.revisit.num_times_unchanged_threshold

Threshold for number of times a page has been unchanged when deciding whether to revisit it when using the SimpleRevisitPolicy.

5

data_report

Specifies if data reports should be generated for the crawl.

true

vital_servers

Specifies a list of servers that must be present in the crawl for a successful update.

Options controlling what gets included

Option Description Default

include_patterns

URLs matching this are included in crawl (unless they match any exclude_patterns).

exclude_patterns

URLs matching this are excluded from the crawl.

/cgi-bin,/vti,/_vti,calendar,SQ_DESIGN_NAME=print,SQ_ACTION=logout,SQ_PAINT_LAYOUT_NAME=,%3E%3C/script%3E,google-analytics.com

crawler.use_sitemap_xml

Specifies if sitemap.xml files should be processed during a web crawl.

false

crawler.start_urls_file

Path to a file that contains a list of URLs (one per line) that will be used as the starting point for a crawl. Note that this setting overrides the start_url that the crawler is passed on startup (usually stored in the crawler.start_url configuration option).

collection.cfg.start.urls

start_url

Crawler seed URL. Crawler follows links in this page, and then the links of those pages and so on.

_disabled__see_start_urls_file

crawler.protocols

Crawl URLs via these protocols (comma separated list).

http,https

crawler.reject_files

Do not crawl files with these extensions.

asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip,Z

crawler.accept_files

Only crawl files with these extensions. Not normally used - default is to accept all valid content.

crawler.store_all_types

If true, override accept/reject rules and crawl and store all file types encountered.

false

crawler.store_empty_content_urls

If true, store URLs even if, after filtering, they contain no content.

false

crawler.non_html

Specifies non-html file formats to filter, based on the file extension (e.g. pdf, doc, xls)

doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm

crawler.allowed_redirect_pattern

Specify a regex to allow crawler redirections that would otherwise by disallowed by the current include/exclude patterns.

Option Description Default

crawler.parser.mimeTypes

Extract links from these comma-separated or regexp: content-types.

text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml

crawler.extract_links_from_javascript

Whether to extract links from Javascript while crawling.

false

crawler.follow_links_in_comments

Whether to follow links in HTML comments while crawling.

false

crawler.link_extraction_group

The group in the crawler.link_extraction_regular_expression which should be extracted as the link/URL.

crawler.link_extraction_regular_expression

The expression used to extract links from each document. This must be a Perl compatible regular expression.

Options controlling size limits and timeouts

Option Description Default

crawler.max_dir_depth

A URL with more than this many sub directories will be ignored (too deep, probably a crawler trap)

15

crawler.max_download_size

Maximum size of files crawler will download (in MB). Default: 10MB

10

crawler.max_files_per_area

Maximum files per area e.g. number of files in one directory or generated by one dynamic generator e.g. index.asp?doc=123. This parameter used to be called crawler.max_dir_size

10000

crawler.max_files_per_server

Maximum files per server (default (empty) is unlimited)

crawler.max_files_stored

Maximum number of files to download (default, and less than 1, is unlimited)

crawler.max_link_distance

How far to crawl from the start_url (default is unlimited). e.g. if crawler.max_link_distance = 1, only crawl the links on start_url. NB: Turning this on drops crawler to single-threaded operation.

crawler.max_parse_size

Crawler will not parse documents beyond this many megabytes in size

10

crawler.max_url_length

A URL with more characters than this will be ignored (too long, probably a crawler trap)

256

crawler.max_url_repeating_elements

A URL with more than this many repeating elements (directories) will be ignored (probably a crawler trap or incorrectly configured web server)

5

crawler.overall_crawl_timeout

Maximum crawl time after which the update continues with indexing and changeover. The units of this parameter depend on the value of the crawler.overall_crawl_units parameter.

24

crawler.overall_crawl_units

The units for the crawler.overall_crawl_timeout parameter. A value of hr indicates hours and min indicates minutes.

hr

crawler.request_timeout

Timeout for HTTP page GETs (milliseconds)

15000

crawler.max_timeout_retries

Maximum number of times to retry after a network timeout (default is 0)

0

Authentication options

Option Description Default

crawler.allow_concurrent_in_crawl_form_interaction

Enable/disable concurrent processing of in-crawl form interaction.

true

crawler.form_interaction.pre_crawl.groupId.url

Specify a URL of the page containing the HTML web form in pre_crawl authentication mode

crawler.form_interaction.in_crawl.groupId.url_pattern

Specify a URL or URL pattern of the page containing the HTML web form in in_crawl authentication mode

crawler.ntlm.domain

NTLM domain to be used for web crawler authentication.

crawler.ntlm.password

NTLM password to be used for web crawler authentication.

crawler.ntlm.username

NTLM username to be used for web crawler authentication.

ftp_passwd

Password to use when gathering content from an FTP server.

ftp_user

Username to use when gathering content from an FTP server.

http_passwd

Password used for accessing password protected content during a crawl when.

http_user

Username used for accessing password protected content during a crawl.

Web crawler monitor options

Option Description Default

crawler.monitor_authentication_cookie_renewal_interval

Optional time interval at which to renew crawl authentication cookies

crawler.monitor_checkpoint_interval

Time interval at which to checkpoint (seconds).

1800

crawler.monitor_delay_type

Type of delay to use during crawl (dynamic or fixed).

dynamic

crawler.monitor_halt

Checked during a crawl - if set to true then crawler will cleanly shutdown.

false

crawler.monitor_preferred_servers_list

Optional list of servers to prefer during crawl.

crawler.monitor_time_interval

Time interval at which to output monitoring information (seconds).

30

crawler.monitor_url_reject_list

Optional parameter listing URLs to reject during a running crawl.

HTTP options

Option Description Default

http_proxy

The hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.

http_proxy_passwd

The proxy password to be used during crawling.

http_proxy_port

Port of HTTP proxy used during crawling.

http_proxy_user

The proxy user name to be used during crawling.

http_source_host

IP address or hostname used by crawler, on a machine with more than one available.

crawler.request_header

Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.

crawler.request_header_url_prefix

Optional URL prefix to be applied when processing the crawler.request_header parameter.

crawler.store_headers

Write HTTP header information at top of HTML files if true. Header information is used by indexer.

true

Logging options

Option Description Default

crawler.verbosity

Verbosity level (0-6) of crawler logs. Higher number results in more messages.

4

crawler.header_logging

Option to control whether HTTP headers are written out to a separate log file (default is false).

false

crawler.incremental_logging

Option to control whether a list of new and changed URLs should be written to a log file during incremental crawling

false

Web crawler advanced options

Option Description Default

crawler

The name of the crawler binary.

com.funnelback.crawler.FunnelBack

crawler_binaries

Location of the crawler files.

crawler.accept_cookies

Cookie policy. Default is false i.e. do not accept cookies. Requires HTTPClient if true.

true

crawler.cache.DNSCache_max_size

Maximum size of internal DNS cache. Upon reaching this size the cache will drop old elements.

200000

crawler.cache.LRUCache_max_size

Maximum size of LRUCache. Upon reaching this size the cache will drop old elements.

500000

crawler.cache.URLCache_max_size

Maximum size of URLCache. May be ignored by some cache implementations.

50000000

crawler.check_alias_exists

Check if aliased URLs exists - if not, revert back to original URL

false

crawler.classes.Frontier

Java class used for the frontier (a list of URLs not yet visited).

com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000

crawler.classes.Policy

Java class used for enforcing the include/exclude policy for URLs

com.funnelback.crawler.StandardPolicy

crawler.classes.URLStore

Java class used to store content on disk e.g. create a mirror of files crawled.

com.funnelback.common.store.WarcStore

crawler.eliminate_duplicates

Whether to eliminate duplicate documents while crawling (default is true)

true

crawler.frontier_num_top_level_dirs

Optional setting to specify number of top level directories to store disk based frontier files in.

crawler.frontier_use_ip_mapping

Whether to map hosts to frontiers based on IP address. (default is false)

false

crawler.frontier_hosts

Lists of hosts running crawlers if performing a distributed web crawl.

crawler.frontier_port

Port on which DistributedFrontier will listen on.

crawler.max_individual_frontier_size

Maximum size of an individual frontier (unlimited if not defined)

crawler.inline_filtering_enabled

Option to control whether text extraction from binary files is done inline during a web crawl

true

crawler.lowercase_iis_urls

Whether to lowercase all URLs from IIS web servers (default is false)

false

crawler.predirects_enabled

Enable crawler predirects (boolean). See: crawler predirects

crawler.remove_parameters

Optional list of parameters to remove from URLs.

crawler.robotAgent

Matching is case-insensitive over the length of the name in a robots.txt file.

Funnelback

crawler.sslClientStore

Path to a SSL Client certificate store (absolute or relative). Empty/missing means no client certificate store. Certificate stores can be managed by Java’s keytool.

crawler.sslClientStorePassword

Password for the SSL Client certificate store. Empty/missing means no password, and may prevent client certificate validation. Certificate stores can be managed by Java’s keytool.

crawler.sslTrustEveryone

Trust ALL Root Certificates and ignore server hostname verification if true. This bypasses all certificate and server validation by the HTTPS library, so every server and certificate is trusted. It can be used to overcome problems with unresolveable external certificate chains and poor certificates for virtual hosts, but will allow server spoofing.

true

crawler.sslTrustStore

Path to a SSL Trusted Root store (absolute or relative). Empty/missing means use those provided with Java. Certificate stores can be managed by Java’s keytool.

crawler.send-http-basic-credentials-without-challenge

This option controls whether or not Funnelback sends any HTTP credentials along with every request.

true

schedule.incremental_crawl_ratio

The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl).

10

See also

© 2015- Squiz Pty Ltd