Threshold for edit distance between two versions of a page when deciding whether it has changed or not.

Key: crawler.revisit.edit_distance_threshold
Type: Integer
Can be set in: collection.cfg

Table of Contents


This parameter specifies a threshold to use when deciding whether the content of a URL has changed compared to a previous version. The edit distance is the number of operations (add, edit, delete) that would be required to transform one string into the other.

If the edit distance is less than this threshold then the page is marked as "unchanged" and this information will be fed into the crawler’s revisit policy. Pages that don’t change very often may not be revisited as often and a copy of their content may be used instead.

Default Value