Include Patterns (collection.cfg)


This option is a comma-separated list of URL patterns that are used by the crawler to determine whether it will process a page. If the page's URL matches one of these patterns, then the crawler will process it.

See include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.

Default value

(none, set when the collection is created)


If you were crawling and wanted to download just the support directory (and nothing else), then you would use the following include pattern:

If you wanted to crawl the entire site then you would use:

You can include a protocol (http or https) in the pattern, but it is not usually necessary.

If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:,

NB: You should specify some form of include pattern for the webcrawler, otherwise it will start downloading content from the global web and fill up the hard disk.

Note: The parameter is stored in the collection.cfg in the form:

See also