Eliminating duplicate search results

This article provides advices on how to prevent duplicates appearing in the search results.

Funnelback employs various techniques to remove duplicates (and near-duplicates), but it is still possible for duplicates to appear in the search results.

Check the search data model for the duplicate results

View the search.json response and look at the duplicated results. Check the collection parameter - this will indicate the data source index that returned the result. If you find the same result returned from different data sources then you will need to add exclude rules to ensure that only one data source indexes the content.

If the collection value is the same for the duplicated entries then the content is duplicated within the data source.

For web datasources:

  • The web crawler attempts to remove duplicates and near duplicates by analyzing the page content. It will remove duplicates (and near duplicates) based on the similarity of the content.

  • However, if the pages are different enough (e.g. different titles with the same body content) then duplicates can slip through. This is particularly an issue with generated pages where you might get the same page generated with different URLs.

How can I prevent duplicate content?

Canonical URLs

If the duplication occurs within a single web datasource, your site is generated or has variant pages with the same content you can provide Funnelback (and other web crawlers like Google) with a canonical URL, provided via a <link> tag included in the page header.

A canonical URL tells any web robot parsing the page that to override the URL used to access the page with the one specified in the canonical URL. This will be the URL used to reference the page in the search index and will prevent duplicates if all the variants of the page contain the same canonical URL.

Exclude the duplicates

If the duplicates occur in different data sources then you should ensure that it is added to the exclude patterns of the other data sources.

If the duplication occurs within a single data source you may be able to exclude URLs causing the duplication if they have a common URL pattern.

Result collapsing

You may also be able to configure result collapsing to suppress the duplicated results, but this required a common URL, content signature or metadata field combination to use as a common key to collapse on.