Web data sources - controlling what information is included
Funnelback provides various controls that help to define what is included and excluded from a web crawl.
- Include/exclude rules
- Robots.txt, robots meta tags and sitemap.xml support
- Robots meta tags
- Funnelback noindex/endnoindex and Googleoff/Googleon directives
- See also
Include/exclude rules can be used to define a set of patterns that are compared to the document’s URL. These patterns define if a URL should be kept or rejected based on the URL itself.
a web page that must be visited in order to reach other pages must be included otherwise any child or linked pages may not be discovered by the web crawler. If you wish to exclude a page (such as a home page) that must be crawled through then this can be removed using a
The crawler processes each URL it encounters against the various options in the data source configuration to determine if the URL will be included (
+) or excluded (
−) from further processing:
Assuming you had the following options:
Then the following URLs will be included or excluded…
passes include, passes exclude
passes include, fails exclude
To express more advanced include or exclude patterns you can use regular expressions for the
exclude_patterns configuration options.
Regular expressions follow Perl 5 syntax and start with
regexp: followed by a compound regular expression in which each distinct include/exclude pattern is separated by the
|Regex and simple include/exclude patterns cannot be mixed within a single configuration option.|
An example of the more advanced
regexp: form is:
which combines five alternative patterns into one overall pattern expression to match:
search?date=for example, to exclude calendars.
Dynamic content generated by URLs containing
Dynamic content from CGI scripts.
regex special characters that appear in patterns must be escaped (e.g.
The crawler supports exclusion of URLs during a running crawl. The crawler.monitor_url_reject_list data source configuration parameter allows an administrator to specify additional URLs patterns to exclude while the crawler is running. These URL patterns will apply from the next crawler checkpoint and should be converted to a regular exclude pattern once the crawl completes.
The Funnelback web crawler supports the following:
robots.txt directives as outlined at http://www.robotstxt.org/robotstxt.html
The FunnelBack user agent can be used to provide Funnelback specific
e.g. Prevent access to
/login* for all crawlers but allow Funnelback to access everything.
User-agent: * Disallow: /search Disallow: /login User-agent: Funnelback Disallow: Sitemap: http://www.example.com/sitemap.xml
Funnelback only supports the directives outlined in the original
In particular the following are not supported:
robots.txtdirectives that include a wildcard (the wildcard is only supported on the User-agent directive)
Funnelback honors robots meta tags as outlined at http://www.robotstxt.org/meta.html as well as the
The following directives can appear within a
<meta name="robots"> tag:
follow / nofollow index / noindex nosnippet / noarchive
<!-- index this page but don't follow links --> <meta name="robots" content="index, nofollow" /> <!-- index this page but don't follow links and don't allow caching or snippets --> <meta name="robots" content="index, nofollow, nosnippet" />
|robots directives supplied via HTTP headers are not supported.|
nofollow directives provided in
rel attribute of a HTML anchor
<a> tag. e.g.
<!-- don't follow this link --> <a href="mylink.html" rel="nofollow" />
Funnelback does not process
Funnelback supports the extraction of links from linked
sitemap.xml files (including nested and compressed sitemaps) that are specified in the
robots.txt. Please note that other directives within
sitemap.xml files (such as
priority) are ignored.
Links listed in
sitemap.xml files are added to the list of URLs to crawl if they pass the configured include/exclude rules.
sitemap.xml support does not prevent the crawler from following links or processing the start urls file but just adds an additional source of links to add to the crawl.
Funnelback noindex tags (and the Google equivalents) are special HTML comments that can be used to denote parts of a web page as not containing content. This hides the text from the indexer and words included in a noindex region will not be included in the search index. However, any links contained within a
noindex region will still be extracted and processed. e.g.
... This section is indexed ... <!--noindex--> ... This section is not indexed ... <!--endnoindex--> ... This section is indexed ...
Google HTML comment tags that are equivalent to the Funnelback
endnoindex tags are also supported. The following are aliases of Funnelback’s native tags:
<!-- noindex --> == <!-- googleoff: index --> == <!-- googleoff: all --> <!-- endnoindex --> == <!-- googleon: index --> == <!-- googleon: all -->
|other googleoff/on tags are not supported.|
Noindex tags should be included within a site template to exclude headers, footers and navigation. Funnelback also provides a built-in inject no-index filter that can write these noindex tags into a downloaded web page based on rules. However this should only be used if it is not possible to modify the source pages as changes to the source page templates can result in the filter not working correctly.