crawler.classes.RevisitPolicy
Background
This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.
A revisit policy might look at a URL in the URL store and decide that since it hasn’t changed in the last 5 times we downloaded it we will assume that it hasn’t changed this time and not perform a revisit. Instead, we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.
The revisit policy is only used during incremental crawls. |
Setting the key
Set this configuration key in the search package or data source configuration.
Use the configuration key editor to add or edit the crawler.classes.RevisitPolicy
key, and set the value. This can be set to any valid String
value.
Examples
crawler.classes.RevisitPolicy=com.funnelback.common.revisit.SimpleRevisitPolicy
Change to use a revisit policy which implements the following:
-
Initially, everything is crawled
-
If after
crawler.revisit.num_times_unchanged_threshold
crawls, the page has never changed, then the page will not be crawled for the nextcrawler.revisit.num_times_revisit_skipped_threshold
crawls. -
The URL will then have to be crawled
crawler.revisit.num_times_unchanged_threshold
times again without any changes before it will be skipped again. -
A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.