Ignore canonical links in web pages

Background

This article outlines the steps required to ignore canonical links that are embedded in web pages.

Details

Canonical links are great when used correctly - they allow a web administrator to define the URL that should be used to reference the page (especially when there may be lots of aliases that all resolve to the same page). By default Funnelback will always use a canonical link if it is defined, replacing the URL of the page that was crawled with whatever was defined in the canonical link tag.

However, it is fairly common for the canonical link tag to be used incorrectly, with an incorrect URL being placed within the tag.

For example if every page on a website had a canonical link that was the site’s homepage then Funnelback would only end up with a single document in the index as all the other pages would be marked as duplicates.

This article outlines the steps required to ignore canonical links that are embedded in web pages.

Always try to get the website owner to fix the canonical URL at the source before ignoring the links.

Process

There are a couple of ways to ignore the canonical links defined in web pages.

The easiest method is to set an indexer option that instructs the indexer to ignore the canonical link references.

  1. Add the following to the indexer_options in collection.cfg

    indexer_options= -ignore_link_rel_canonical
  2. Update or reindex the collection

The only way to achieve this is to write a jsoup filter that removes the canonical links from the desired pages.