Specifies the pattern that URLs must match in order to be crawled.
Can be set in: collection.cfg
This option is a comma-separated list of URL patterns that are used by the crawler to determine whether
it will process a page. If the page’s URL matches one of these patterns, then the crawler will process
it. URLs which match
exclude_patterns will not be crawled even if they match the include pattern, except for start urls.
See: include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.
If you were crawling http://www.funnelback.com and wanted to download just the support directory (and nothing else), then you would use the following include pattern:
If you wanted to crawl the entire http://www.funnelback.com site then you would use:
You can include a protocol (http or https) in the pattern, but it is not usually necessary.
If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:
|You should specify some form of include pattern for the webcrawler, otherwise it will start downloading content from the global web and fill up the hard disk.|