Include Patterns (collection.cfg)
This option is a comma-separated list of URL patterns that are used by the crawler to determine whether it will process a page. If the page's URL matches one of these patterns, then the crawler will process it.
See include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.
(none, set when the collection is created)
If you were crawling http://www.funnelback.com and wanted to download just the support directory (and nothing else), then you would use the following include pattern:
If you wanted to crawl the entire http://www.funnelback.com site then you would use:
You can include a protocol (http or https) in the pattern, but it is not usually necessary.
If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:
NB: You should specify some form of include pattern for the webcrawler, otherwise it will start downloading content from the global web and fill up the hard disk.
Note: The parameter is stored in the collection.cfg in the form: