include_patterns

Background

This option is a comma-separated list of URL patterns that are used by the crawler to determine whether it will process a page. If the page’s URL matches one of these patterns, then the crawler will process it. URLs which match exclude_patterns will not be crawled even if they match the include pattern, except for start urls.

See: include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the include_patterns key, and set the value. This can be set to any valid List<String> value.

Default value

(none, set when the collection is created)

Examples

If you were crawling http://www.funnelback.com and wanted to download just the support directory (and nothing else), then you would use the following include pattern:

include_patterns=www.funnelback.com/support

If you wanted to crawl the entire http://www.funnelback.com site then you would use:

include_patterns=www.funnelback.com/

You can include a protocol (http or https) in the pattern, but it is not usually necessary.

If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:

include_patterns=anu.edu.au,usyd.edu.au
You should specify some form of include pattern for the web crawler, otherwise it will start downloading content from the global web and fill up the hard disk.