Specifies the pattern that URLs must match in order to be crawled.

Key: include_patterns
Type: List<String>
Can be set in: collection.cfg


This option is a comma-separated list of URL patterns that are used by the crawler to determine whether it will process a page. If the page’s URL matches one of these patterns, then the crawler will process it. URLs which match exclude_patterns will not be crawled even if they match the include pattern, except for start urls.

See: include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.

Default Value

(none, set when the collection is created)


If you were crawling and wanted to download just the support directory (and nothing else), then you would use the following include pattern:

If you wanted to crawl the entire site then you would use:

You can include a protocol (http or https) in the pattern, but it is not usually necessary.

If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:,
You should specify some form of include pattern for the web crawler, otherwise it will start downloading content from the global web and fill up the hard disk.