crawler.eliminate_duplicates

Background

This parameter controls whether duplicate documents identified during a crawl should be eliminated. The default behaviour is true i.e. delete all duplicates when they are found.

Duplicate detection is performed by extracting the human-readable text from a file, ignoring any markup and tags. The intent is to detect files which look the same to a human. This means that differences in embedded metadata etc. will be ignored during this process.

The webcrawler will detect most HTML duplicates during the crawl, but it will not detect duplicate binary files (e.g. PDF or Office files).

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.eliminate_duplicates key, and set the value. This can be set to any valid Boolean value.

Default value

crawler.eliminate_duplicates=true

Examples

Turn off in-crawl duplicate detection:

crawler.eliminate_duplicates=false

See also