Configuring the web crawler

This page gives a guide to configuring the Funnelback web crawler. The web crawler is used to gather pages for indexing, by following hypertext links and downloading documents. The data source creation page for a web data source sets the required options used to configure the web crawler. These include giving it a start point, how long to crawl for, what domain to stay within and which areas to avoid.

For most purposes the default settings for the other crawler parameters will give good performance. This page is for administrators who have particular performance or crawling requirements.

A full list of all crawler related parameters (plus default values) is given on the configuration page. Configuration parameters can be modified by editing the data source configuration.

Speeding up web crawling

The default behaviour of the web crawler is to be as polite as possible. This is enforced by requiring that only one crawler thread should be accessing an individual web server at any one time. This prevents multiple crawler threads from overloading a web server. It is implemented by mapping individual servers to specific crawler threads.

If you have control over the web server(s) being crawled you may decide to relax this constraint, particularly if you know they can handle the load. This can be accomplished by using site_profiles, where you can specify how many parallel requests to use for particular web servers.

Two general parameters which can also be tuned for speed are:

crawler.num_crawlers=20
crawler.request_delay=250

Increasing the number of crawlers (threads) will increase throughput, as will decreasing the delay between requests. The latter is specified in milliseconds, with a default delay of one quarter of a second. We do not recommend decreasing this below 100ms.

These parameters should be tuned with care to avoid overloading web servers and/or saturating your network link. If crawling a single large site we recommend starting with a small number of threads (e.g. 4) and working up until acceptable performance is reached. Similarly, for decreasing the request delay i.e. work down until the overall crawl time has been satisfactorily reduced.

Incremental crawling

An incremental crawl updates an existing set of downloaded pages instead of starting the crawl from scratch. The crawler achieves this by comparing the document length provided by the Web server (in response to a HTTP head request) with that obtained in the previous crawl. This can reduce network traffic and storage requirements and speed up data source update times.

The ratio of incremental to full updates can be controlled by the schedule.incremental_crawl_ratio configuration option.

The number of scheduled incremental crawls that are performed between each full crawl (e.g. a value of '10' results in an update schedule consisting of every ten incremental crawls being followed by a full crawl). This configuration option is only referenced by the update system when no explicit update options are provided by the administrator.

Revisit policies

Incremental crawls utilise a revisit policy to further refine the behaviour of an incremental crawl. The revisit policy is used by the crawler to make a decision to revisit a site during an incremental crawl based on how frequently the site content has been found to change.

Funnelback supports two types of revisit policies:

  • Always revisit policy: This is the default behaviour. Funnelback will always revisit a URL and check the HTTP headers when performing an incremental crawl.

  • Simple revisit policy: Funnelback tracks how frequently a page changes and will make a decision to skip the URL for the current crawl based on configuration settings .

Crawling critical sites/pages

If you have a requirement to keep your index of a particular website (or sites) as up-to-date as possible, you could create a specific data source for this area. For example, if you have a news site which is regularly updated you could specify that the news data source be crawled at frequent intervals. Similarly, you might have a set of "core" critical pages which must be indexed when new content becomes available.

You could use some of the suggestions in this document on speeding up crawling and limiting crawl size to ensure that the update times and cycles for these critical data sources meet your requirements.

You could then create a separate data source for the rest of your content which may not change as often or where the update requirements are not as stringent. This larger data source could be updated over a longer time period. You would then attach both of these data sources to the same search package.

Alternatively, you could use an instant update. See updating a data source index for more details.

Adding additional file types

By default, the crawler will store and index html, PDF, Microsoft Office, RTF and text documents. Funnelback can be configured to store and index additional file types. See: configure Tika to index additional supported file types

Specify preferred (canonical) server names

For some data sources you may decide you wish to control what server name the crawler uses when storing content. For example, a site may have been renamed from www.old.com to www.new.com, but because so many pages still link to the old name the crawler may store the content under the old name (unless HTTP or HTML redirects have been set up).

A simple text file can be used to specify which name to use e.g. www.new.com=www.old.com. This can also be used to control whether the crawler treats an entire site as a duplicate of another (based on the content of their home page). Details on how to set this up are given in the documentation for the server_alias_file.

Memory requirements

The process which runs the web crawler will take note of the gather.max_heap_size setting in the data source configuration. This will specify the maximum size of the heap for the crawler process, in MB. For example the default is set at:

gather.max_heap_size=640

This should be suitable for most crawls which crawl less than 250k URLs. For crawls over this size you should expect to increase the heap size up to at least 2000MB, subject to the amount of RAM available and what other large jobs might be running on the machine.

Crawling dynamic web sites

In most cases Funnelback will crawl dynamically generated websites by default. However, some sites (e.g. e-commerce, product catalogs etc.) may enforce the use of cookies and sessions IDs. These are normally used to track a human user as they browse through a site.

By default, the Funnelback web crawler is configured to accept cookies by default, by having the following configuration options set:

crawler.accept_cookies=true
crawler.packages.httplib=HTTPClient

This turns on cookie storage in memory (and allows cookies to be sent back to the server), by using the appropriate HTTP library. Note that even if a site uses cookies it should still return valid content if a client (e.g. the crawler) does not make use of them.

It is also possible to strip session IDs and other superfluous parameters from URLs during the crawl. This can help reduce the amount of duplicate or near-duplicate content brought back. This is configured using the following optional configuration option (with an example expression):

crawler.remove_parameters=regexp:&style(sheet)?=mediaRelease|&x=\d+

The example above will strip off style and stylesheet parameters, or x=21037536 type parameters (e.g. session IDs). It uses regular expressions (Perl 5 syntax) and the regexp: flag is required at the start of the expression. Note that this parameter is normally empty by default.

Finally, the last parameter which you may wish to modify when crawling dynamic websites is:

crawler.max_files_per_area=10000

This configuration option is used to specify the maximum number of files the crawler should download from a particular area on a website. You may need to increase this if a lot of your content is served from a single point e.g. example.com/index.asp?page_id=348927. The crawler will stop downloading after it reaches the limit for this area. In this case you would need to increase the limit to ensure all the content you require is downloaded.

Crawling password-protected websites

Crawling sites protected by HTTP Basic authentication or Windows integrated authentication (NTLM) is covered in a separate document on crawling password-protected sites.

Sending custom HTTP request header fields

In some circumstances you may want to send custom HTTP request header fields in the requests that the web crawler makes when contacting a website. For example, you might want to send specific cookie information to allow the crawler to log in to a website that uses cookies to store login information.

The following two parameters allow you to do this:

Form-based authentication

Some websites require a login using an HTML form. If you need to crawl this type of content you can specify how to interact with the forms using form interaction.

Once the forms have been processed the web crawler can use the resulting cookie to authenticate its requests to the site.

Form-based authentication is different from HTTP basic authentication. Details on how to interact with this are described in a separate document on crawling password-protected websites.

Crawling with pre-defined cookies

In some situations you may need to crawl a site using a pre-defined cookie. Further information on this configuration option is available from the cookies.txt page.

Crawling HTTPS websites

This is covered in a separate document: Crawling HTTPS websites.

Crawling Sharepoint websites

  • If your Sharepoint site is password protected you will need to use Windows Integrated Authentication when crawling - see details on this in the document on crawling password-protected sites.

  • You may need to configure "alternate access mappings" in Sharepoint so that it uses a fully qualified hostname when serving content e.g. serving content using http://sharepoint.example.com/ rather than http://sharepoint/. Please see your Sharepoint administration manual for details on how to configure these mappings.

Limiting crawl size

In some cases you may wish to limit the amount of data brought back by the crawler. The usual approach would be to specify a time limit for the crawl:

crawler.overall_crawl_timeout=24
crawler.overall_crawl_units=hr

The default timeout is set at 24 hours. If you have a requirement to crawl a site within a certain amount of time (as part of an overall update cycle) you can set this to the desired value. You should give the crawler enough time to download the most important content, which will normally be found early on in the crawl. You can also try speeding up the crawler to meet your time limit.

Another configuration option which can be used to limit crawl size is: crawler.max_files_stored

This is the maximum number of files to store on disk (default is unlimited). Finally, you can specify the maximum link distance from the start point (default is unlimited):

For example, if crawler.max_link_distance=1, only crawl the links listed as start URLs. This could be used to restrict the crawl to a specific list of URLs.

Setting the crawler.max_link_distance configuration option cause the crawler to run with a single crawler thread.

Redirects

The crawler stores information about redirects in a file called redirects.txt in the data source’s log directory. This records information on HTTP redirects, HTML meta-refresh directives, duplicates, canonical link directives etc.

This information is then processed by the indexer and used in ranking e.g. ensuring that anchor text is associated with the correct redirect target etc.

Changing configuration parameters during a running crawl

This feature is not available in the Squiz DXP.

The web crawler monitor options provide a number of settings that can be dynamically adjusted while a crawl is running.

See also