crawler.max_link_distance
Specifies the maximum distance a URL can be from a start URL for it to be downloaded.
Key: crawler.max_link_distance
Type: Integer
Can be set in: collection.cfg
Description
This option configures the crawler to follow links a specific "distance" from the start URL(s).
If this option is set, then the crawler will run in single-threaded mode (i.e. only one web connection will be made at a time) to control which URLs are processed. This will have an impact on performance i.e. slower crawl.
Examples
Limit the crawler to the URLs linked to from all URLs listed in the crawler.start_urls_file
(set 1):
crawler.max_link_distance=1
Limit to all pages linked to from set 1:
crawler.max_link_distance=2
Notes:
-
A distance of zero (0) will limit the crawl to just the
start_url
or all the URLs listed in thecrawler.start_urls_file
. -
A redirect target is considered at the same distance as the original URL which generated the redirect.