crawler.classes.RevisitPolicy

Background

This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.

A revisit policy might look at a URL in the URL store and decide that since it hasn’t changed in the last 5 times we downloaded it we will assume that it hasn’t changed this time and not perform a revisit. Instead, we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.

The revisit policy is only used during incremental crawls.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.classes.RevisitPolicy key, and set the value. This can be set to any valid String value.

Default value

Revisit every document every update.

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.AlwaysRevisitPolicy

Examples

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.SimpleRevisitPolicy

Change to use a revisit policy which implements the following:

  1. Initially, everything is crawled

  2. If after crawler.revisit.num_times_unchanged_threshold crawls, the page has never changed, then the page will not be crawled for the next crawler.revisit.num_times_revisit_skipped_threshold crawls.

  3. The URL will then have to be crawled crawler.revisit.num_times_unchanged_threshold times again without any changes before it will be skipped again.

  4. A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.