crawler.monitor_url_reject_list

Background

This parameter can be modified during a running crawl to tell the crawler to ignore the specified list of URLs for the remainder of the crawl. Normally if you know before a crawl what areas to avoid you would add them to the exclude_patterns parameter. The format to use is a comma separated list of URLs.

Matching URLs gathered prior to this configuration change will not be affected. Matching URLs that are already in the crawl frontier (the list of known but uncrawled URLs) will not be removed until the URL is processed.

The pattern must include a protocol/schema e.g. http://www.example.com/ and not www.example.com

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.monitor_url_reject_list key, and set the value. This can be set to any valid String value.

Default value

(Empty)

Examples

Reject any URLs from the given sites or sub-sites during a running crawl:

crawler.monitor_url_reject_list=http://abc.com/,http://d.com/site/

See also