crawler.classes.RevisitPolicy
Specifies the Java class used for enforcing the revisit policy for URLs.
Key: crawler.classes.RevisitPolicy
Type: String
Can be set in: collection.cfg
Description
This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.
A revisit policy might look at a URL in the URL store and decide that since it hasn’t changed in the last 5 times we downloaded it we will assume that it hasn’t changed this time and not perform a revisit. Instead we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.
The revisit policy is only used during incremental crawls. |
Default Value
Revisit every document every update.
crawler.classes.RevisitPolicy=com.funnelback.common.revisit.AlwaysRevisitPolicy
Examples
crawler.classes.RevisitPolicy=com.funnelback.common.revisit.SimpleRevisitPolicy
Change to use a revisit policy which implements the following:
-
Initially, everything is crawled
-
If after
crawler.revisit.num_times_unchanged_threshold
crawls, the page has never changed, then the page will not be crawled for the nextcrawler.revisit.num_times_revisit_skipped_threshold
crawls. -
The URL will then have to be crawled
crawler.revisit.num_times_unchanged_threshold
times again without any changes before it will be skipped again. -
A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.