crawler.revisit.edit_distance_threshold

Background

This parameter specifies a threshold to use when deciding whether the content of a URL has changed compared to a previous version. The edit distance is the number of operations (add, edit, delete) that would be required to transform one string into the other.

If the edit distance is less than this threshold then the page is marked as "unchanged" and this information will be fed into the crawler’s revisit policy. Pages that don’t change very often may not be revisited as often and a copy of their content may be used instead.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.revisit.edit_distance_threshold key, and set the value. This can be set to any valid Integer value.

Default value

crawler.revisit.edit_distance_threshold=20