Web data sources - controlling what information is included

Funnelback provides various controls that help to define what is included and excluded from a web crawl.

By default, Funnelback will exclude a lot of URLs because they are not relevant to a set of search results. e.g. it makes no sense to index linked files such as images, CSS and Javascript files as they add no value to a search.

Include/exclude rules

Include/exclude rules can be used to define a set of patterns that are compared to the document’s URL. These patterns define if a URL should be kept or rejected based on the URL itself.

a web page that must be visited in order to reach other pages must be included otherwise any child or linked pages may not be discovered by the web crawler. If you wish to exclude a page (such as a home page) that must be crawled through then this can be removed using a kill_exact.cfg or kill_pattern.cfg after the index is built.

The crawler processes each URL it encounters against the various options in the data source configuration to determine if the URL will be included (+) or excluded () from further processing:

Example

Assuming you had the following options:

include_patterns=/red,/green,/blue
exclude_patterns=/green/olive

Then the following URLs will be included or excluded…​

URL Success? Comments

/orange

FAIL

fails include

/green/emerald

PASS

passes include, passes exclude

/green/olive

FAIL

passes include, fails exclude

Regular expressions in include/exclude patterns

To express more advanced include or exclude patterns you can use regular expressions for the include_patterns and exclude_patterns configuration options.

Regular expressions follow Perl 5 syntax and start with regexp: followed by a compound regular expression in which each distinct include/exclude pattern is separated by the | character.

Regex and simple include/exclude patterns cannot be mixed within a single configuration option.

An example of the more advanced regexp: form is:

exclude_patterns=regexp:search\?date=|^https:|\?OpenImageResource|/cgi-bin/|\.pdf$

which combines five alternative patterns into one overall pattern expression to match:

  1. search?date= for example, to exclude calendars.

  2. HTTPS urls

  3. Dynamic content generated by URLs containing ?OpenImageResource.

  4. Dynamic content from CGI scripts.

  5. PDF files.

regex special characters that appear in patterns must be escaped (e.g. \? and \.):
include_patterns=regexp:\.anu\.edu\.au

Excluding URLs based on a content match

The exclude by content plugin can be used to define exclude rules based on a match within document content. For example, this enables you to exclude URLs where a specific metadata field is set to a specific value.

Excluding URLs during a running crawl

The crawler supports exclusion of URLs during a running crawl. The crawler.monitor_url_reject_list data source configuration parameter allows an administrator to specify additional URLs patterns to exclude while the crawler is running. These URL patterns will apply from the next crawler checkpoint and should be converted to a regular exclude pattern once the crawl completes.

Robots.txt, robots meta tags and sitemap.xml support

The Funnelback web crawler supports the following:

robots.txt

Funnelback honours robots.txt directives as outlined at http://www.robotstxt.org/robotstxt.html

The FunnelBack user agent can be used to provide Funnelback specific robots.txt directives

e.g. Prevent access to /search* and /login* for all crawlers but allow Funnelback to access everything.

User-agent: *
Disallow: /search
Disallow: /login

User-agent: Funnelback
Disallow:

Sitemap: http://www.example.com/sitemap.xml
Funnelback only supports the directives outlined in the original robots.txt standard. Subsequent extensions to this supported by other vendors are not supported.

In particular the following are not supported:

  • robots.txt allow directives

  • robots.txt directives that include a wildcard (the wildcard is only supported on the User-agent directive)

Ignoring a site’s robots.txt

The crawler.ignore_robots_txt setting can be used to disable the web crawler’s robots.txt support.

This setting should only be used as a last resort as it disables the web crawler’s adherence to the robots.txt standard. It should only be used when it is not possible for the site owner to update the robots.txt file.

Before you enable this setting you must inform the site owner(s) and gain their permission to circumvent the robots.txt directives.

Using this setting can also have unwanted side effects for the web crawler (such as disabling support for sitemap.xml files) and sites that you are crawling. You should carefully check your web crawler logs to ensure you’re not storing or accessing content that you don’t wish to access add appropriate exclude patterns. (For example, you should ensure any search results pages and calendar feeds are explicitly added to your exclude patterns).

Ignoring robots.txt could also result in Funnelback being blacklisted from accessing your site by the site owner.

Disable robots.txt adherence in the web crawler:

crawler.ignore_robots_txt=true

Robots meta tags

Funnelback honors robots meta tags as outlined at http://www.robotstxt.org/meta.html as well as the nosnippet and noarchive directives.

The following directives can appear within a <meta name="robots"> tag:

follow / nofollow
index / noindex
nosnippet / noarchive

e.g.

<!-- index this page but don't follow links -->
<meta name="robots" content="index, nofollow" />
<!-- index this page but don't follow links and don't allow caching or snippets -->
<meta name="robots" content="index, nofollow, nosnippet" />

Ignoring the noindex directive

The indexer_options can be set by adding the -ignore_noindex option to ignore the noindex directive when building the search index.

Disable the indexer’s adherence to noindex directives:

indexer_options= <other existing indexer options> -ignore_noindex

Ignoring the nofollow directive

The crawler.ignore_nofollow setting can be used to disable the web crawler’s robots nofollow support.

Disable the web crawler’s adherence to nofollow directives:

crawler.ignore_nofollow=true

These settings should only be used as a last resort as they disable the Funnelback’s adherence to the robots noindex and nofollow directives. They should only be used when it is not possible for the site owner to enable Funnelback’s access using other standard mechanisms (such as a sitemap.xml file linked from robots.txt).

Before you enable either of these settings you must inform the site owner(s) and gain their permission to circumvent the noindex and nofollow directives.

Using these settings can also have unwanted side-effects for the Funnelback and sites that you are crawling and indexing. You should carefully check your logs to ensure you’re not storing, accessing or indexing content that you do not wish to, and add appropriate exclude patterns. (For example, you should ensure any search results pages and calendar feeds are explicitly added to your exclude patterns).

Ignoring robots noindex and nofollow directives can also result in Funnelback being blacklisted from accessing your site by the site owner as these cause Funnelback to behave like a bad web robot.

X-Robots HTTP headers

Funnelback does not currently support robots directives supplied via X-Robots-Tag HTTP headers.

HTML <a rel="nofollow">

Funnelback honors nofollow directives provided in rel attribute of a HTML anchor <a> tag. e.g.

<!-- don't follow this link -->
<a href="mylink.html" rel="nofollow" />

Sitemap.xml

Funnelback does not process sitemap.xml files by default - this must be enabled using the crawler.use_sitemap_xml configuration option.

Funnelback supports the extraction of links from linked sitemap.xml files (including nested and compressed sitemaps) that are specified in the robots.txt. Please note that other directives within sitemap.xml files (such as lastmod and priority) are ignored.

Links listed in sitemap.xml files are added to the list of URLs to crawl if they pass the configured include/exclude rules.

Enabling sitemap.xml support does not prevent the crawler from following links or processing the start urls file but just adds an additional source of links to add to the crawl.

The following limits apply to the processing of sitemap.xml files

  • The size of the sitemap.xml file is limited to 10MB (uncompressed)

  • The maxumum number of links extracted from a sitemap.xml file is capped at 50 000.

Funnelback noindex/endnoindex and Googleoff/Googleon directives

Funnelback noindex tags (and the Google equivalents) are special HTML comments that can be used to denote parts of a web page as not containing content. This hides the text from the indexer and words included in a noindex region will not be included in the search index. However, any links contained within a noindex region will still be extracted and processed. e.g.

... This section is indexed ...
 <!--noindex-->

... This section is not indexed ...

 <!--endnoindex-->

... This section is indexed ...

Google HTML comment tags that are equivalent to the Funnelback noindex/endnoindex tags are also supported. The following are aliases of Funnelback’s native tags:

<!-- noindex --> == <!-- googleoff: index --> == <!-- googleoff: all -->
<!-- endnoindex --> == <!-- googleon: index --> == <!-- googleon: all -->
other googleoff/on tags are not supported.

Noindex tags should be included within a site template to exclude headers, footers and navigation. Funnelback also provides a built-in inject no-index filter that can write these noindex tags into a downloaded web page based on rules. However this should only be used if it is not possible to modify the source pages as changes to the source page templates can result in the filter not working correctly.

See also