Document kill - remove (kill) URLs/documents from the index

Document killing does not apply to push data sources. To remove documents from a push data source you can use the Push API to remove the desired documents.

It is not uncommon to find items being returned in the search results that are not useful search results.

It is generally a good idea to prevent the URL from being gathered in the first place, either by using robots.txt or robots metadata, or by setting include/exclude rules on your data source.

However, in some cases it is not possible to do this, and you need to add a step to remove these URLs after the index has been built. The kill configuration enables you to do this.

When to use document kill

Use document kill:

  • (for web data sources) if you don’t have ability to control robots.txt or apply robots noindex tags to your content and need to gather the content pages in order to reach other parts of your site.

  • (for other data source) if you don’t have the ability to exclude the documents from the gatherer, either via gatherer configuration or by controlling what documents the gatherer can access (e.g. via permissions, or a more specific SQL query).

Don’t use document kill:

  • To remove documents from a push data source. The push configuration will have no effect. For push data sources it is best to ensure the document is not pushed in the first place. If you need to remove a document use the Push API to remove it.

Example use case

Killing the home page is a common use for kill configuration - where you don’t want your home page to come up in the results, but you need to crawl this page to get to the rest of your site.

Kill configuration can be used to solve this problem - including the home page in the crawl (so that sub-pages can be crawled and indexed) but removing the actual home page afterwards.

Examples of other items that are commonly removed:

  • Home pages

  • Site maps

  • A-Z listings

  • 'Index' listing pages

Removing items from the index is as simple as listing the URLs in a configuration file. After the index is built a process runs that will remove any items that are listed in the kill configuration.

Using robots meta tags is often a better way to control this because it will apply to all web crawlers that index your website. Add a <meta name="robots" content="noindex,follow" /> meta tag to tell a robot to follow the links on the page but not index it.

Killing URLs based on a match to the URL

Funnelback provides two configuration files that enable you to remove URLs from the search index, based on a match to the URL:

  • kill_exact.cfg: URLs exactly matching those listed in this file will be removed from the index. The match is based on the indexUrl (as seen in the data model).

  • kill_partial.cfg: URLs with the start of the URL matching those listed in the file will be removed from the index. Again the match is based on the indexUrl (as seen in the data model).

Killing the set of URLs returned by a search query

Funnelback provides a configuration file that enables you to remove URLs from the search index if they are returned by a search query.

The configuration file accepts one or more queries.

This is specified using:

  • query-kill.cfg: URLs returned by queries listing in this file will be removed from the search index.

Checking the status of what has been killed during the update

Document kill is applied as part of the indexing phase and messages for this are logged to the Step-ExactMatchKill.log, Step-PartialMatchKill.log and Step-KillDocumentsByQuery.log, found in your data source logs.

These log files all produce a before/after comparison of documents, allowing you to see how many documents were killed by the step.

    Running command: /opt/funnelback/bin/padre-fl
    With arguments: /opt/funnelback/data/training~ds-simpsons/one/idx_reindex/index /opt/funnelback/data/training~ds-simpsons/live/tmp/query.kill-docs.q._removeDuplicated_20230825 -exactmatch -kill
    Command will not read from STDIN
    Environment: {TEMP=/tmp/1692922586290-0, EXECUTABLE_DIR=/opt/funnelback/bin, LD_LIBRARY_PATH=/opt/funnelback/bin, TMP=/tmp/1692922586290-0, TMPDIR=/tmp/1692922586290-0}
    Log output: /opt/funnelback/data/training~ds-simpsons/live/log/Step-KillDocumentsByQuery.log
####################################################################################################

Making backup: /bin/cp /opt/funnelback/data/training~ds-simpsons/one/idx_reindex/index.dt /opt/funnelback/data/training~ds-simpsons/one/idx_reindex/index.dt_bak

Showing summary before changes, if any
{
 "total_documents": 911,
 "expired_documents": 0,
 "killed_documents": 0, (1)
 "duplicate_documents": 0,
 "noindex_documents": 0,
 "filtered_binary_documents": 0,
 "documents_without_an_early_binding_security_lock": 911,
 "documents_with_paid_ads": 0,
 "unfiltered_binary_documents": 0,
 "documents_matching_admin_specified_regex": 0,
 "noarchive_documents": 0,
 "nosnippet_documents": 0
}
URL Patterns: 80 found and sorted.
Document URLs sorted: 911
Performing specified operation (bittz = 0, bitop = 4)...
   num_docs = 911.  num_pats = 80  (2)
Showing summary after changes if any:
{
 "total_documents": 911,
 "expired_documents": 0,
 "killed_documents": 80, (3)
 "duplicate_documents": 0,
 "noindex_documents": 0,
 "filtered_binary_documents": 0,
 "documents_without_an_early_binding_security_lock": 911,
 "documents_with_paid_ads": 0,
 "unfiltered_binary_documents": 0,
 "documents_matching_admin_specified_regex": 0,
 "noarchive_documents": 0,
 "nosnippet_documents": 0
}
Command finished with exit code: 0
1 The first killed_documents count indicates the number of killed documents before the kill step was applied. (The three kill steps are applied individually).
2 the num_pats value indicates the number of kill patterns that were applied when the kill was run.
3 The second killed_documents count indicates the number of killed documents after the kill step was applied. This doesn’t need to match the num_pats. If it’s a kill by URL pattern being applied then each pattern can match multiple documents. If it’s a kill exact matching URL or kill by query being applied then each pattern supplied to the kill is an individual URL. However, for kill by pattern you might find that some URLs have already been killed in a previous step, or don’t match anything in the index. For kill by query you might find that some of the URLs are duplicated between queries (and can only be killed once).

In addition, when killing by query you will get an additional log file, Step-KillDocumentsByQuery._kill-url-list.log that lists out the set of URLs that were killed by each of the queries. This is useful when debugging if you’re trying to determine why a document is missing from the results (or is still appearing if you are expecting it should have been removed).

Tutorials

Tutorial: Remove URLs from an index

In this exercise we will remove some specific pages from a search index.

  1. Run a search for shakespeare against the book finder results page. Observe the home page results result (The Complete Works of William Shakespeare, URL: https://docs.squiz.net/training-resources/shakespeare/) is returned in the search results.

  2. Open the search dashboard and switch to the shakespeare data source.

  3. Click the manage data source configuration files item from the settings panel, then create a new kill_exact.cfg. This configuration file removes URLs that exactly match what is listed within the file.

  4. The editor window will load for creating a new kill_exact.cfg file. The format of this file is one URL per line. If you don’t include a protocol then http is assumed. Add the following to kill_exact.cfg then save the file:

    https://docs.squiz.net/training-resources/shakespeare/
  5. As we are making a change that affects the makeup of the index we will need to rebuild the index. Rebuild the index by running an advanced update to re-index the live view.

  6. Repeat the search for shakespeare and observe that the Complete Works of William Shakespeare result no longer appears in the results.

  7. Run a search for hamlet and observe that a number of results are returned, corresponding to different sections of the play. We will now define a kill pattern to remove all of the hamlet items. Observe that all the items we want to kill start with a common URL base.

  8. Return to the search dashboard and create a new kill_partial.cfg.

  9. Add the following to the file and then save it:

    https://docs.squiz.net/training-resources/shakespeare/hamlet/
  10. Rebuild the index then repeat the search for hamlet. Observe that all the results relating to chapters of Hamlet are now purged from the search results.

You can view information on the number of documents killed from the index by viewing the Step-ExactMatchKill.log and Step-PartialMatchKill.log log files, from the log viewer on the shakespeare data source.