Eliminating duplicate search results
This article provides advices on how to prevent duplicates appearing in the search results.
Funnelback employs various techniques to remove duplicates (and near-duplicates), but it is still possible for duplicates to appear in the search results.
Check the search data model for the duplicate results
search.json response and look at the duplicated results. Check the
collection parameter - this will indicate the data source index that returned the result. If you find the same result returned from different data sources then you will need to add exclude rules to ensure that only one data source indexes the content.
collection value is the same for the duplicated entries then the content is duplicated within the data source.
For web datasources:
The web crawler attempts to remove duplicates and near duplicates by analyzing the page content. It will remove duplicates (and near duplicates) based on the similarity of the content.
However, if the pages are different enough (e.g. different titles with the same body content) then duplicates can slip through. This is particularly an issue with generated pages where you might get the same page generated with different URLs.
How can I prevent duplicate content?
If the duplication occurs within a single web datasource, your site is generated or has variant pages with the same content you can provide Funnelback (and other web crawlers like Google) with a canonical URL, provided via a
<link> tag included in the page header.
A canonical URL tells any web robot parsing the page that to override the URL used to access the page with the one specified in the canonical URL. This will be the URL used to reference the page in the search index and will prevent duplicates if all the variants of the page contain the same canonical URL.
Exclude the duplicates
If the duplicates occur in different data sources then you should ensure that it is added to the exclude patterns of the other data sources.
If the duplication occurs within a single data source you may be able to exclude URLs causing the duplication if they have a common URL pattern.
You may also be able to configure result collapsing to suppress the duplicated results, but this required a common URL, content signature or metadata field combination to use as a common key to collapse on.