crawler.ignore_canonical_links

Background

This parameter controls whether the webcrawler should ignore the canonical link(s) on an HTML page.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.ignore_canonical_links key, and set the value. This can be set to any valid Boolean value.

Default value

crawler.ignore_canonical_links=false

The default behaviour is false i.e., do NOT ignore the canonical url setup on an HTML page

Notes

This parameter controls whether the crawler shall ignore the canonical link(s) setup on an HTML page (e.g., <link rel="canonical"href="www.example.com"/>). By default, its value is false. It means that the crawler would process the content on this non-canonical page, including the canonical reference URL, but no content on this non-canonical page would be stored for indexing.

If this flag were true, the crawler would remove the canonical link setup on the crawled HTML page to avoid the link extraction and storage of the canonical reference URL, then process the rest of the page as normal. Since the canonical link setup were removed from the crawled content, the Padre indexer option to enable/disable the canonical link indexing (i.e., indexer_options =-ignore_link_rel_canonical) will not work even though "-ignore_link_rel_canonical" is not set.