Controlling what is indexed

Background

This article looks at the different techniques that can be employed to control what is crawled and indexed by Funnelback when creating a searchable index of web content.

Robots.txt

Use robots.txt to prevent well behaved crawlers from accessing complete sections of a site, based on the URL path.

Funnelback will honour robots.txt directives.

The FunnelBack user agent can be used to create Funnelback specific rules. Eg. The following robots.txt prevents web robots from crawling the /login and /search content but allows access to Funnelback.

robots.txt
User-agent: *
Disallow: /search
Disallow: /login

User-agent: FunnelBack
Disallow:

Note: that robots.txt only allows wildcards to be present in the User-agent directive and that there is no Allow: directive (only Disallow:).

See also:

Robots meta tags

Page level robots directives can be defined within the <head> of a HTML document.

These directives take the form of html <meta> elements and allow page-level control of well-behaved robots.

Funnelback will honour the following robots meta tags:

  • Index, follow: this tells the web crawler to index the page and also follow any links that are found on the page. This is the default behaviour.

  • Index, nofollow: this tells the web crawler to index the page but to not follow any links that are found on the page. This only applies to links found on this page, and doesn’t prevent the link being indexed if it’s found somewhere else on the site.

  • Noindex, follow: this tells the web crawler to not include the page in the search index, but to still extract and follow any links that are found within the page. This is good if you wish to exclude home page(s) from a crawl as you need to crawl through the page to get to child pages.

  • Noindex, nofollow: this tells the web crawler to not include the page in the index and also to not follow any links from this page. This has the same end effect of having the page listed in the robots.txt file, but is controllable at the page level.

The following directives can be added to the (no)index/(no)follow directives to provide further control of crawler behaviour.

  • Noarchive: this tells supporting web crawlers to index the page but to not include the document in the cache (so cache links won’t work)

  • Nosnippet: this tells supporting web crawlers to index the page but to not return a snippet when rendering results. This also prevents the caching of the page (as defined by the noarchive directive above).

Note: there are a number of additional robots meta tags that have been defined by third parties such as Google (such as noimageindex and nocache). Please note that these tags are not currently supported by Funnelback.

See also: Robots meta tags

A nofollow tag will only mean that Funnelback doesn’t add the links found on the page to the crawl frontier. If the link is found elsewhere without any form of exclusion then it will still appear in the index.

HTML rel="nofollow"

The HTML anchor (<a>) tag includes an attribute that can be used to prevent a web crawler from following a link. This is the same as the robots meta nofollow tag, but applies only to the link on which it is set.

To prevent Funnelback from following a specific link on a page add a rel attribute to the anchor tag. E.g. don’t following this link:

<a href="mylink.html" rel="nofollow" />

The rel="nofollow" attribute can be useful when you are using Funnelback to generate result listings that are embedded in a standard page on the site and it’s not practical to use a robots.txt or robots meta tags to prevent Funnelback from crawling the search results.

A nofollow attribute will only mean that Funnelback doesn’t add the specific links to the crawl frontier. If the link is found elsewhere without any form of exclusion then it will still appear in the index.

Funnelback noindex tags

Controlling which parts of a web page are considered for search results relevance is a very important and simple process that can have a dramatic effect on the relevancy of results returned by a query.

Noindex tags are HTML comment tags that should be included in site templates (and wherever else is appropriate). Areas of the page code that don’t contain page content should be excluded from consideration by the indexer. This means that when indexing the page only the relevant content is included in the index.

Funnelback includes a built-in InjectNoIndexFilterProvider filter that can inject noindex tags into the crawled content based on CSS selectors. This is a good fall-back if you can’t add the noindex tags to the site template, however always try to add them to the template first because there is less chance of it breaking over time.

This has a few benefits. The most obvious one is that search result relevance will immediately improve due to the removal of a lot of noise from the search results. For example a search for contact information won’t potentially return every page on a site just because contact us happens to appear in the site navigation.

A secondary benefit is that the search result summaries will become much more relevant as well as snippet text will only include the indexed content.

Applying noindex tags is as simple as adding <!-- noindex --> and <!-- endnoindex --> comment tags to your site templates. For example:

<body>
... This section is indexed ...
<!--noindex-->
... Text in this section is not indexed, but the link information is recorded by Funnelback for ranking purposes ...
<!--endnoindex-->
... This section is indexed ...
</body>

For most sites noindex tags should be placed around the site navigation, headers and footers. This prevents Funnelback returning every page in response to the queries such as about and contact and also ensures that navigation and headers are excluded from contextual search summaries. Noindex tags can also be used to hide other parts of the page such as advertisements.

Noindex tags don’t have to be specified in matching noindex/endnoindex pairs – the document is parsed from top to bottom and switches whenever a tag is encountered.

Don’t forget to put a <!--endnoindex--> before the start of any content otherwise Funnelback will have nothing to index.

Google noindex tag support

There are some similar Google-specific tags that provide equivalent noindex functionality.

Funnelback sees the following as aliases for the noindex/endnoindex tags:

  • Funnelback native tag: <!-- noindex -->

  • Google equivalent tags: <!-- googleoff: index --> or <!-- googleoff: all -->

  • Funnelback native tag: <!-- endnoindex -->

  • Google equivalent tags: <!-- googleon: index --> or <!-- googleon: all -->

Note: the Googleoff/on: anchor and snippet tags are not supported by Funnelback.

Crawler include/exclude patterns

The web crawler can be configured with include and exclude rules that are applied to every URL that is encountered.

These include/exclude patterns are matched against the document’s URL using either a substring or regular expression match. Items that are not within the include scope and outside the exclude scope will be skipped by the crawler. This is a good way to remove items from a search, however it will also mean that any sub-pages that are not separately linked will also be excluded because the file is skipped entirely. For example if you added a top level home page to your exclude patterns it is likely that the rest of the site would be missed. If you wished to only exclude the home page and crawl the rest of the site then using a robots metadata tag with a value of noindex,follow would be the appropriate approach to take.

Crawler included file types and file size limits

The file types accepted by the web crawler are controlled by a few different crawler settings:

  • crawler.accept_files, crawler.reject_files and crawler.store_all_types can be used to control the different types of files that the crawler accepts.

  • crawler.parser.mimeTypes controls the files that are parsed by the crawler for the purpose of extracting further links to crawl.

The following settings control the size of documents accepted/processed:

  • crawler.max_download_size controls the maximum file size that will be accepted by a crawler (default is 3MB).

  • crawler.max_parse_size controls the maximum amount of a document that will be parsed (default is 3MB).

Indexer exclusion and kill patterns

The indexer can be configured to remove documents by configuring a kill_exact.cfg or kill_partial.cfg. Any documents that match the patterns defined in these files will be removed from the search index.

There is also a pair of indexer options that can be used to exclude documents based on URL (similar to the crawler include/exclude rules) check_url_exclusion and url_exclusion_pattern. Documents with URLs that match the url_exclusion_pattern when check url exclusion is enabled will also be removed from the index.

It is obviously better to exclude items before they reach the indexer (as time is wasted fetching and processing documents that are ultimately removed from the index) - but these options can be useful in some circumstances where other options listed above are not possible.