Using robots.txt and robots meta tags to control what is indexed by search engines

Control what is indexed by a search crawler

Public search engines such as Google and Bing, and site based search engines (such as the DXP search, if you index your website as a web data source) use a web-crawler to gather and index the content from your website.

A web crawler is given one or more pages on your site where it will start a web crawl. It fetches these pages, and then basically clicks on each link it finds in the page, repeating this process on each page until it runs out of pages on your site, or it is told to stop (for example by you setting up your crawl to run for a maximum amount of time).

Robots.txt and the robots standard can be used provide a web robot, such as a web crawler, with instructions on whether it should gather the page or follow links it finds in the page.

There are three ways of providing these instructions (at the site, page or link level), and you can choose to use one or more of these methods at the same time.

Robots.txt - control web robots at the website level

Use robots.txt to prevent well-behaved crawlers from accessing complete sections of a site, based on the URL path. The DXP search will honor robots.txt directives unless you have explicitly turned this off in your search crawler configuration.

When configuring robots.txt always disallow the search results pages and consider also consider preventing access to any pages where Funnelback is being used to generate the page content (such as document/file browse sections) as this will prevent potential crawler traps which will cause the crawler to continue fetching pages indefinitely.

The Funnelback web crawler uses identifies itself using the Funnelback robot agent, which can be used to create specific rules for the DXP search. E.g. The following robots.txt prevents web robots from crawling the /login and /search content but allows access to the Funnelback robot agent.

User-agent: *
Disallow: /search
Disallow: /login

User-agent: Funnelback
Disallow:
The robots.txt standard only allows wildcards to be present in the User-agent directive and there is no Allow: directive (only Disallow:). There are some extensions to the standard supported by large web crawlers, such as Google and Bing, that allow both of these to be put within your robots.txt file, however many web robots will ignore these rules. The Funnelback web crawler only adheres to the directives outlined in the robots.txt documentation

Tutorial: Create a robots.txt file

See: Creating a robots.txt file for a step-by-step tutorial for creating a robots.txt file in Matrix.

Robots meta tags - control web robots at the page level

The web robots standard also provides a set of HTML meta tags that can be used to provide instructions to web crawlers at the page level.

Page level robots directives can be defined within the <head> of an HTML document.

These directives take the form of html <meta> elements and allow page-level control of well-behaved robots.

The Funnelback web crawler will honour the following robots meta tags:

  • index, follow: this tells the web crawler to index the page and also follow any links that are found on the page. This is the default behavior.

  • index, nofollow: this tells the web crawler to index the page but to not follow any links that are found on the page. This only applies to links found on this page, and doesn’t prevent the link being indexed if it’s found somewhere else on the site.

  • noindex, follow: this tells the web crawler to not include the page in the search index, but to still extract and follow any links that are found within the page. This is good if you wish to exclude home page(s) from a crawl as you need to crawl through the page to get to child pages.

  • noindex, nofollow: this tells the web crawler to not include the page in the index and also to not follow any links from this page. This has the same end effect of having the page listed in the robots.txt file, but is controllable at the page level.

A nofollow tag will only mean that the web crawler doesn’t add the links found on the page to the set of pages that are to be crawled. If the link is found elsewhere without any form of exclusion then it will still appear in the index.

The following directives can be added to the (no)index/(no)follow directives to provide further control of crawler behavior. These are not currently supported by the Funnelback web crawler.

  • noarchive: this tells supporting web crawlers to index the page but to not include the document in the cache (so cache links won’t work)

  • nosnippet: this tells supporting web crawlers to index the page but to not return a snippet when rendering results. This also prevents the caching of the page (as defined by the noarchive directive above).

It is also possible to provide page-level directives using X-Robots HTTP headers. These are not currently supported by the Funnelback web crawler.

there are a number of additional robots meta tags that have been defined by third parties such as Google (such as noimageindex and nocache).

Tutorial: Prevent a home page from appearing in the search index

  1. Add a meta robots noindex,follow tag to your site home page so that the site is indexed but the home page is excluded from the search results. See: Metadata

  2. Crawl and index the site and observe that the home page is automatically excluded from the search results.

The HTML anchor <A> tag includes an attribute that can be used to prevent a web crawler from following a link. This is the same as the robots meta nofollow tag, but applies only to the link on which it is set.

To prevent the Funnelback crawler from following a specific link on a page add a rel attribute to the anchor tag.

Don’t follow this link
<a href="mylink.html" rel="nofollow" />
A nofollow attribute means that the Funnelback web crawler doesn’t add the specific link to the set of pages that are to be crawler. If the link is found elsewhere without any form of exclusion then it will still appear in the index.
  1. Create a new standard page that doesn’t display in the site navigation (i.e. link type: hidden link)

  2. Add a link to this page from your home page and include add a rel attribute on the link to prevent it from being crawled.

  3. Crawl the site and observe that the new page is skipped by the crawler and is not included in the search index.