Gather optimisation - general advice

Background

This article provides general advice on optimising the gather process in Funnelback.

General collection configuration and gatherer settings

Within the Funnelback application are a number of features that allow performance to be tuned in crawling, indexing and query processing.

Crawling/gathering

Crawl settings can be used to decrease the time required to download content from a target repository. Please note that crawl times will still be dependent upon the maximum possible response rate of the targeted repository (e.g. if the TRIM repository can only process 1 record per second).

Settings available vary depending on gatherer type.

Analyse the gatherer logs

Analysis of the gather logs can provide useful information on optimisations that can be implemented:

  • Exclude items that are not required for the search index or are generating errors into the logs.

  • Look out for crawler traps and add appropriate exclusions.

  • Adjust the allowed maximum download size to an appropriate value. (default is 3MB).

  • For web collections, if there are lots of repeated URLs that are returning 404 errors it is likely that a link in the template is the cause of the error. Fixing the issue in the template can solve a lot of problems or if you are unable to correct the template you should add an exclude pattern.

Multithreaded crawling

When crawling web sites, the Funnelback crawler will only submit one request at a time per server by default (e.g. wait for the first request to site A to be completed, before requesting another page from site A). If the website server has additionally capacity, the Funnelback crawler can be configured to send simultaneous requests per server (e.g. request two pages at a time from site A). Site profiles (site_profiles.cfg) allow this to be controlled on a per-host basis.

TRIM gathering has similar settings – in peak times the crawler will run single-threaded, but can be configured to run with multiple threads in off-peak times (or all the time).

Incremental updates

When first updating a web collection, Funnelback will download all files matching its include patterns. For subsequent updates however, the crawler can make HEAD requests to the targeted web server, requesting only the file size in bytes of the document. If this size is the same as the previous time the file was downloaded, the crawler skips a complete request (GET).

Other collection types employ similar techniques. E.g. TRIM collections only gather items that have changed since a defined time. Filecopy collections will copy content from cache if it hasn’t changed since the last update.

Web collections also provide the option to run a refresh crawl (which crawls whatever it can find in a given time, over the top of existing data), and also to apply revisit policies which affect how often the crawler will check to see if something has changed.

Disable the accessibility and content auditor reports

Accessibility and content auditing requires a lot of document processing during the crawl – to parse and analyse each document.

Disabling these reports means that there is no need to perform this extra document filtering.

Request throughput

Web collections only. When requesting content from a web site, the Funnelback crawler will wait between each subsequent request. The crawler can be configured to wait for greater or lesser periods, to consider the period as static or dynamic and to increase or decrease timeout periods.

In addition, the crawler can be configured to retry connections to timed out pages. This setting may increase the likelihood of downloading certain pages (e.g. CMS pages that take a long time to generate), at the cost of increasing the overall update time required. Site profiles can be used to define this on a per-host basis.

WARC format

Funnelback historically stored downloaded content in a folder structure that mirrored the data source. This resulted in a file on the filesystem for each document that was downloaded.

Recent versions of Funnelback are now configured to use WARC storage, which causes all of the downloaded content to be stored within a single compressed file.

At indexing time, the Funnelback indexer will only have to open one file, rather than many, decreasing the time required. WARC storage also decreases hard disk size requirements, and is much more efficient for copying processes as a single large file is being copied compared to numerous small files.