crawler.use_sitemap_xml
Background
This parameter controls whether sitemap.xml files listed in robots.txt
are used during a web crawl.
Setting the key
Set this configuration key in the search package or data source configuration.
Use the configuration key editor to add or edit the crawler.use_sitemap_xml
key, and set the value. This can be set to any valid Boolean
value.
Examples
Specify that sitemap.xml
files should be used:
crawler.use_sitemap_xml=true
With this setting the Funnelback web crawler will check the robots.txt
files for each web server that
passes the crawl include patterns. If the robots.txt
file contains any Sitemap:
directives these will be processed, including any sitemap index files and compressed sitemap.xml
files.
Notes:
-
The Funnelback webcrawler will not currently take note of any
lastmod
elements in thesitemap.xml
files - standard incremental crawling can still be used to avoid downloading any content which has not changed, based on its content length. -
If the
crawler.max_individual_frontier_size
parameter is defined and non-empty then this will be used as a limit on the total number of URLs that will be extracted from the sitemap file(s) for any individual site. -
All URLs extracted from sitemaps will be processed in the same way as links extracted while crawling normal web pages e.g. they will be run through the loading policy. This means they will be checked against relevant
robots.txt
rules, include and exclude patterns etc.