Failed changeover conditions

Description

This error indicates that the number of documents contained within an index has decreased beyond an acceptable limit.

Error message

Displayed in the update-<DATA-SOURCE-ID>.log file

Error details:
Failure in: Checking the total number of documents indexed is sufficient. step name: 'ChangeOverIndexCountCheck'
Caused by:
Views will not be swapped because the changeover condition check failed. This data set has only 6.45% (5597/86775) as many docs as the last one. See the changeover_percent setting for details.

Cause

Funnelback compares the number of documents in a newly built (offline) index to the number of documents contained within the current live index. If the new counts is below the configured changeover_percent (default: 50%) then the update will fail with this error as the small index can indicate that something went wrong during the update (e.g. the internet went down while crawling and many expected pages were not fetched).

Resolution

Once-off expected reduction

If the reduction in size is expected (e.g. due to a website being redesigned with significantly fewer pages) then an update can be run that overrides this check:

  1. Confirm that the offline view of the data source has the expected number of documents (check the document counts in the Step-Index.log for the offline view.

  2. To make the offline index live and skip the document count check select advanced update from the data source management screen and then select the swap live and offline view.

  3. The update will start and complete very quickly as this is a quick operation.

Data sources that have highly variable document counts

If the error is caused by a data source that has highly variable document counts:

Adjust the threshold used for the swap views check (the default is 50%) by setting the changeover_percent data source configuration .

Missing seed or section pages

If the crawler was unable to fetch a seed page or an important section-level page then this could also cause large numbers of documents to not be included in the index.

For example, if a home page in the seed list was offline when the crawler attempted to fetch it then the crawler won’t gather any content from that site, which could mean a large number of documents are missing for the current crawl. This is one of the primary reasons why there is a changeover check performed as part of the update.

Check your crawler url_errors.log for page not found errors, and look out for important top-level pages. If you see something that you expect should be in the index then you might have a reason for your failure. You can check this URL further, first with your web browser and if that works ok then with the DEBUG API. Often you can just start another update and your index will update successfully because the URL is no longer offline.

Some network device is blocking the crawler

Occasionally a network device or service can decide to block the Funnelback crawler, resulting in a much smaller number of documents being stored. For example, this can be caused by denial of service prevention measure that don’t whitelist the Funnelback crawler. See: Crawl runs fine for a while then everything times out.