Implementer training - removing items from the index

It is not uncommon to find items being returned in the search results that are not useful search results.

Removing these, improves the findability of other items in the index and provides a better overall search experience for end users.

There are a few different techniques that can be used to remove unwanted items from the index.

Prevent access to the item

Prevent the item from being gathered by Funnelback by preventing Funnelback from accessing it. This is something that needs to be controlled in the data source and the available methods are dependent on the data source. The advantage of this technique is that it can apply beyond Funnelback.

For example:

  • For web data sources utilize robots.txt and robots meta tags to disallow access for Funnelback or other crawlers, or instruct a crawler to follow links but not index a page (or vice versa).

  • Change document permissions so that Funnelback is not allowed to access the document (e.g. for a file system data source only grant read permissions for Funnelback’s crawl user to those documents that you wish to be included (and hence discovered) in the search).

Exclude the item

The simplest method of excluding an item from within Funnelback is to adjust the gatherer so that the item is not gathered.

The exact method of doing this varies from gatherer to gatherer. For example:

  • Web data sources: use include and exclude patterns and crawler.reject_files setting to prevent unwanted URL patterns and file extensions from being gathered.

  • Database data source: adjust the SQL query to ensure the unwanted items are not returned by the query.

  • File system data sources: use the exclude patterns to prevent unwanted files from being gathered.

The use of exclude patterns in a web data source needs to be carefully considered to assess if it will prevent Funnelback from crawling content that should be in the index. For example, excluding a home page in a web crawl will prevent Funnelback from crawling any pages linked to by the home page (unless they are linked from somewhere else) as Funnelback needs to crawl a page to extract links to sub-pages. In this case you should make use of robots meta tags (with noindex,follow values) or by killing the URLs at index time (see below).

Killing urls / url patterns

It is also possible to remove items from the search index after the index is created.

This can be used to solve the home page problem mentioned above - including the home page in the crawl (so that sub-pages can be crawled and indexed) but removing the actual home page afterwards.

Examples of items that are commonly removed:

  • Home pages

  • Site maps

  • A-Z listings

  • 'Index' listing pages

Removing items from the index is as simple as listing the URLs in a configuration file. After the index is built a process runs that will remove any items that are listed in the kill configuration.

For normal data source types, there are two configuration files that control URL removal:

  • kill_exact.cfg: URLs exactly matching those listed in this file will be removed from the index. The match is based on the indexUrl (as seen in the data model).

  • kill_partial.cfg: URLs with the start of the URL matching those listed in the file will be removed from the index. Again the match is based on the indexUrl (as seen in the data model).

For push indexes, URLs can be removed using the push API.

Tutorial: Remove URLs from an index

In this exercise we will remove some specific pages from a search index.

  1. Run a search for shakespeare against the book finder results page. Observe the home page results result (The Complete Works of William Shakespeare, URL: https://docs.squiz.net/training-resources/shakespeare/) is returned in the search results.

  2. Open the search dashboard and switch to the shakespeare data source.

  3. Click the manage data source configuration files item from the settings panel, then create a new kill_exact.cfg. This configuration file removes URLs that exactly match what is listed within the file.

  4. The editor window will load for creating a new kill_exact.cfg file. The format of this file is one URL per line. If you don’t include a protocol then http is assumed. Add the following to kill_exact.cfg then save the file:

    https://docs.squiz.net/training-resources/shakespeare/
  5. As we are making a change that affects the makeup of the index we will need to rebuild the index. Rebuild the index by running an advanced update to re-index the live view.

  6. Repeat the search for shakespeare and observe that the Complete Works of William Shakespeare result no longer appears in the results.

  7. Run a search for hamlet and observe that a number of results are returned, corresponding to different sections of the play. We will now define a kill pattern to remove all of the hamlet items. Observe that all the items we want to kill start with a common URL base.

  8. Return to the search dashboard and create a new kill_partial.cfg.

  9. Add the following to the file and then save it:

    https://docs.squiz.net/training-resources/shakespeare/hamlet/
  10. Rebuild the index then repeat the search for hamlet. Observe that all the results relating to chapters of Hamlet are now purged from the search results.

You can view information on the number of documents killed from the index by viewing the Step-ExactMatchKill.log and Step-PartialMatchKill.log log files, from the log viewer on the shakespeare data source.
Extended exercise: removing items from the index

Configuration files that are managed via the edit configuration files button can also be edited via WebDAV. This currently includes spelling, kill, server-alias, cookies, hook scripts, meta-names, custom_gather, workflow and reporting-blacklist configuration files.

  1. Delete the two kill config files you created in the last exercise then rebuild the index. Observe that the blog pages and items you killed are returned to the results listings.

  2. In your favourite text editor create a kill_exact.cfg and kill_partial.cfg containing the URLs you used in the previous exercise. Save these somewhere easy to access (such as your documents folder or on the desktop).

  3. Using your WebDAV editor, connect to the Funnelback server as you did in the previous WebDAV exercise

  4. change to the default~ds-shakespeare data source and then browse to the conf folder.

  5. Upload the two kill configuration files you’ve saved locally by dragging them into the conf folder in your WebDAV client.

  6. Once the files are uploaded return to the search dashboard and return to the configuration file listing screen that is displayed when you click the manage data source configuration files. Observe that the files you just uploaded are displayed.

  7. Rebuild the index and the URLs should once again be killed from the index.

Review questions: removing items from the index

  1. What are the advantages of using robots.txt or robots meta tags to control access to website content?

  2. Why can’t you kill a home page (like http://example.com/) from the index by adding it to the kill_partial.cfg?