This parameter controls what revisit policy the web crawler uses, where revisit means using a network call (HTTP HEAD and/or GET request) when processing a URL.
A revisit policy might look at a URL in the URL store and decide that since it hasn’t changed in the last 5 times we downloaded it we will assume that it hasn’t changed this time and not perform a revisit. Instead we will use a copy from the previous crawl, and avoid any HEAD or GET requests for that URL.
|The revisit policy is only used during incremental crawls.|
Revisit every document every update.
Change to use a revisit policy which implements the following:
Initially, everything is crawled
crawler.revisit.num_times_unchanged_thresholdcrawls, the page has never changed, then the page will not be crawled for the next
The URL will then have to be crawled
crawler.revisit.num_times_unchanged_thresholdtimes again without any changes before it will be skipped again.
A full crawl will force everything to be crawled, but the values recorded for revisits skipped and num_times_unchanged will be preserved.