include_patterns
Background
This option is a comma-separated list of URL patterns that are used by the crawler to determine whether
it will process a page. If the page’s URL matches one of these patterns, then the crawler will process
it. URLs which match exclude_patterns
will not be crawled even if they match the include pattern, except for start urls.
See: include and exclude patterns for a description on how include and exclude patterns work, and details on using regular expressions if required.
Examples
If you were crawling http://www.funnelback.com
and wanted to download just the support directory (and
nothing else), then you would use the following include pattern:
include_patterns=www.funnelback.com/support
If you wanted to crawl the entire http://www.funnelback.com site then you would use:
include_patterns=www.funnelback.com/
You can include a protocol (http or https) in the pattern, but it is not usually necessary.
If you wanted to crawl every webserver in the Australian National University and University of Sydney domains:
include_patterns=anu.edu.au,usyd.edu.au
You should specify some form of include pattern for the web crawler, otherwise it will start downloading content from the global web and fill up the hard disk. |