Gather optimization - general advice

Background

This article provides general advice on optimizing the gather process in Funnelback.

General data source configuration and gatherer settings

Within the Funnelback application are a number of features that allow performance to be tuned in crawling, indexing and query processing.

Crawling/gathering

Crawl settings can be used to decrease the time required to download content from a target repository. Please note that crawl times will still be dependent upon the maximum possible response rate of the targeted repository (e.g. if the TRIM repository can only process 1 record per second).

Settings available vary depending on gatherer type.

Analyze the gatherer logs

Analysis of the gather logs can provide useful information on optimizations that can be implemented:

  • Exclude items that are not required for the search index or are generating errors into the logs.

  • Look out for crawler traps and add appropriate exclusions.

  • Adjust the allowed maximum download size to an appropriate value. (default is 3MB).

  • For web data sources, if there are lots of repeated URLs that are returning 404 errors it is likely that a link in the template is the cause of the error. Fixing the issue in the template can solve a lot of problems or if you are unable to correct the template you should add an exclude pattern.

Multithreaded crawling

When crawling websites, the Funnelback crawler will only submit one request at a time per server by default (e.g. wait for the first request to site A to be completed, before requesting another page from site A). If the website server has additionally capacity, the Funnelback crawler can be configured to send simultaneous requests per server (e.g. request two pages at a time from site A). Site profiles (site_profiles.cfg) allow this to be controlled on a per-host basis.

TRIM gathering has similar settings – in peak times the crawler will run single-threaded, but can be configured to run with multiple threads in off-peak times (or all the time).

Incremental updates

When first updating a web collection, Funnelback will download all files matching its include patterns. For subsequent updates however, the crawler can make HEAD requests to the targeted web server, requesting only the file size in bytes of the document. If this size is the same as the previous time the file was downloaded, the crawler skips a complete request (GET).

Other collection types employ similar techniques. E.g. TRIM collections only gather items that have changed since a defined time. Filecopy data sources will copy content from cache if unchanged since the last update.

Web data sources also provide the option to run a refresh crawl (which crawls whatever it can find in a given time, over the top of existing data), and also to apply revisit policies which affect how often the crawler will check to see if something has changed.

Disable the accessibility and content auditor reports

Accessibility and content auditing requires a lot of document processing during the crawl – to parse and analyze each document. These can be disabled if you’re not interested in these reports.

Disabling these reports means that there is no need to perform this extra document filtering.

Request throughput

Web data sources only. When requesting content from a website, the Funnelback crawler will wait between each subsequent request. The crawler can be configured to wait for greater or lesser periods, to consider the period as static or dynamic and to increase or decrease timeout periods.

In addition, the crawler can be configured to retry connections to timed out pages. This setting may increase the likelihood of downloading certain pages (e.g. CMS pages that take a long time to generate), at the cost of increasing the overall update time required. Site profiles can be used to define this on a per-host basis.