Funnelback logo

Documentation

CATEGORY

Webcrawler

Introduction

This page gives a guide to configuring the Funnelback web crawler. The web crawler is used to gather pages for indexing, by following hypertext links and downloading documents. The main collection creation page for a web collection describes the standard parameters used to configure the web crawler. These include giving it a start point, how long to crawl for, what domain to stay within and which areas to avoid.

For most purposes the default settings for the other crawler parameters will give good performance. This page is for administrators who have particular performance or crawling requirements.

A full list of all crawler related parameters (plus default values) is given on the configuration page. All of the configuration parameters mentioned can be modified by editing the collection.cfg file for a collection in the Administration interface.

Speeding Up Web Crawling

The default behaviour of the web crawler is to be as polite as possible. This is enforced by requiring that only one crawler thread should be accessing an individual web server at any one time. This prevents multiple crawler threads from overloading a web server. It is implemented by mapping individual servers to specific crawler threads.

If you have control over the web server(s) being crawled you may decide to relax this constraint, particularly if you know they can handle the load. This can be accomplished by using a site_profiles.cfg file, where you can specify how many parallel requests to use for particular web servers.

Two general parameters which can also be tuned for speed are:

crawler.num_crawlers=20
crawler.request_delay=250

Increasing the number of crawlers (threads) will increase throughput, as will decreasing the delay between requests. The latter is specified in milliseconds, with a default delay of one quarter of a second. We do not recommend decreasing this below 100ms.

WARNING: These parameters should be tuned with care to avoid overloading web servers and/or saturating your network link. If crawling a single large site we recommend starting with a small number of threads (e.g. 4) and working up until acceptable performance is reached. Similarly for decreasing the request delay i.e. work down until the overall crawl time has been satisfactorily reduced.

Incremental Crawling

An incremental crawl updates an existing set of downloaded pages instead of starting the crawl from scratch. The crawler achieves this by comparing the document length provided by the Web server (in response to a HTTP head request) with that obtained in the previous crawl. This can reduce network traffic and storage requirements and speed up collection update times.

The ratio of incremental to full updates can be controlled by the following collection.cfg parameter:

schedule.incremental_crawl_ratio
The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl).

This parameter is only referenced by the update system when no explicit update options are provided by the administrator.

An additional configuration parameter used in incremental crawling is the crawler.secondary_store_root setting. The webcrawler will check the secondary store specified in this parameter and not download content from the web which hasn't changed. When a web collection is created the Funnelback administration interface will insert the correct location for this parameter, and it will not normally need to be edited manually.

Crawling Critical Sites/Pages

If you have a requirement to keep your index of a particular web site (or sites) as up-to-date as possible, you could create a specific collection for this area. For example, if you have a news site which is regularly updated you could specify that the news collection be crawled at frequent intervals. Similarly, you might have a set of "core" critical pages which must be indexed when new content becomes available.

You could use some of the suggestions in this document on speeding up crawling and limiting crawl size to ensure that the update times and cycles for these critical collections meet your requirements.

You could then create a separate collection for the rest of your content which may not change as often or where the update requirements are not as stringent. This larger collection could be updated over a longer time period. By using a meta collection you can then combine these collections so that users can search all available information.

Alternatively, you could use an instant update. See Updating Collections for more details.

Specify Preferred Server Names

For some collections you may decide you wish to control what server name the crawler uses when storing content. For example, a site may have been renamed from www.old.com to www.new.com, but because so many pages still link to the old name the crawler may store the content under the old name (unless HTTP or HTML redirects have been set up).

A simple text file can be used to specify which name to use e.g. www.new.com=www.old.com. This can also be used to control whether the crawler treats an entire site as a duplicate of another (based on the content of their home page). Details on how to set this up are given in the documentation for the crawler.server_alias_file parameter.

Crawling Dynamic Web Sites

In most cases FunnelBack will crawl dynamically generated web sites by default. However, some sites (e.g. e-commerce, product catalogs etc.) may enforce the use of cookies and sessions IDs. These are normally used to track a human user as they browse through a site.

By default the Funnelback webcrawler is configured to accept cookies by default, by having the following parameters set:

   * crawler.accept_cookies=true
   * crawler.packages.httplib=HTTPClient

This turns on cookie storage in memory (and allows cookies to be sent back to the server), by using the appropriate HTTP library. Note that even if a site uses cookies it should still return valid content if a client (e.g. the crawler) does not make use of them.

It is also possible to strip session IDs and other superfluous parameters from URLs during the crawl. This can help reduce the amount of duplicate or near-duplicate content brought back. This is configured using the following optional parameter (with an example expression):

   * crawler.remove_parameters=regexp:&style(sheet)?=mediaRelease|&x=\d+

The example above will strip off style and stylesheet parameters, or x=21037536 type parameters (e.g. session IDs). It uses regular expressions (Perl 5 syntax) and the regexp: flag is required at the start of the expression. Note that this parameter is normally empty by default.

Finally, the last parameter which you may wish to modify when crawling dynamic web sites is:

   * crawler.max_files_per_area=10000

This parameter is used to specify the maximum number of files the crawler should download from a particular area on a web site. You may need to increase this if a lot of your content is served from a single point e.g. site.com/index.asp?page_id=348927. The crawler will stop downloading after it reaches the limit for this area. In this case you would need to increase the limit to ensure all the content you require is downloaded.

Crawling Password Protected Websites

Crawling sites protected by HTTP Basic authentication or Windows Integrated authentication (NTLM) is covered in a separate document on crawling password protected sites.

Sending Custom HTTP Request Header Fields

In some circumstances you may want to send custom HTTP request header fields in the requests that the web crawler makes when contacting a web site. For example, you might want to send specific cookie information to allow the crawler to "log in" to a web site that uses cookies to store login information.

The following two parameters allow you to do this:

crawler.request_header
Optional additional header to be inserted in HTTP(S) requests made by the webcrawler.
crawler.request_header_url_prefix
Optional URL prefix to be applied when processing the crawler.request_header parameter

Form-Based Authentication

Some websites require a login using a HTML form. If you need to crawl this type of content you can specify how to interact with the forms using a crawler.form_interaction_file.

Once the forms have been processed the webcrawler can use the resulting cookie to authenticate its requests to the site.

Note: Form-based authentication is different from HTTP basic authentication. Details on how to interact with this are described in a separate document on crawling password protected websites.

Crawling With Pre-Defined Cookies

In some situations you may need to crawl a site using a pre-defined cookie. This can be accomplished by going to the file-manager and creating a "cookies.txt" file, which should be in the Netscape cookie file format, one cookie per line. Sample content from this file with a comment describing the fields is as follows:

# Domain     Tailmatch  Path  Secure Expires Name  Value 
www.example.com  TRUE        /    FALSE  0       id    1234

So this line defines a cookie for the domain www.example.com with no expiry date and a name=value setting of id=1234. The contents of this file might have been created by running something like curl -k -c cookies.txt http://www.example.com/, which would populate the file with cookies received when requesting the given URL. This command could be run as a pre_gather_command in order to have the file automatically generated before the crawl starts.

To ensure that the cookies are read and used during the crawl the following collection.cfg default settings should be enabled:

   * crawler.accept_cookies=true
   * crawler.packages.httplib=HTTPClient

If the cookies.txt file is present it will be read at crawl start-up and any cookies parsed will then be used during the crawl. Any messages relating to errors parsing the cookies.txt file will be in the main crawl.log file.

Crawling HTTPS Websites

This is covered in a separate document on HTTPS support.

Crawling Sharepoint Websites

  • If your Sharepoint site is password protected you will need to use Windows Integrated Authentication when crawling - see details on this in the document on crawling password protected sites.
  • You may need to configure "alternate access mappings" in Sharepoint so that it uses a fully qualified hostname when serving content e.g. serving content using http://sharepoint.example.com/ rather than http://sharepoint/. Please see your Sharepoint administration manual for details on how to configure these mappings.

Limiting Crawl Size

In some cases you may wish to limit the amount of data brought back by the crawler. The usual approach would be to specify a time limit for the crawl:

   * crawler.overall_crawl_timeout=24
   * crawler.overall_crawl_units=hr

The default timeout is set at 24 hours. If you have a requirement to crawl a site within a certain amount of time (as part of an overall update cycle) you can set this to the desired value. You should give the crawler enough time to download the most important content, which will normally be found early on in the crawl. You can also try speeding up the crawler to meet your time limit.

Another parameter which can be used to limit crawl size is:

   * crawler.max_pages_stored=

This is the maximum number of files to store on disk (default is unlimited). Finally, you can specify the maximum link distance from the start point (default is unlimited):

   * crawler.max_link_distance=

For example, if max_link_distance = 1, only crawl the links on start_url. This could be used to restrict the crawl to a specific list of URLs, which were generated by some other process e.g. as a pre_gather command.

WARNING: Turning the max_link_distance parameter on drops the crawler down to single-threaded operation.

See Also

top ⇑