crawler.classes.RevisitPolicy

Specifies the Java class used for enforcing the revisit policy for URLs.

Key: crawler.classes.RevisitPolicy
Type: String
Can be set in: collection.cfg

Description

This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.

A revisit policy might look at a URL in the URL store and decide that since it hasn’t changed in the last 5 times we downloaded it we will assume that it hasn’t changed this time and not perform a revisit. Instead we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.

The revisit policy is only used during incremental crawls.

Default Value

Revisit every document every update.

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.AlwaysRevisitPolicy

Examples

crawler.classes.RevisitPolicy=com.funnelback.common.revisit.SimpleRevisitPolicy

Change to use a revisit policy which implements the following:

  1. Initially, everything is crawled

  2. If after crawler.revisit.num_times_unchanged_threshold crawls, the page has never changed, then the page will not be crawled for the next crawler.revisit.num_times_revisit_skipped_threshold crawls.

  3. The URL will then have to be crawled crawler.revisit.num_times_unchanged_threshold times again without any changes before it will be skipped again.

  4. A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.