Spelling suggestions

Introduction

Funnelback’s spelling suggestion system has the following features:

  1. It is capable of making suggestions even if all words are correctly spelled.

  2. It is not based on standard language-based dictionaries and can make suggestions in non-English and multi-lingual environments.

  3. It takes advantage of query context.

  4. It will not make suggestions which do not match the collection being searched.

  5. It mines suggestions from the document text, document metadata, and external annotations, in order to work in the initial absence of search logs.

  6. It learns and improves as users interact with the system.

This document gives an overview of how the system operates and how it can be configured.

Indexing

Each time a collection is successfully indexed, the annie-a progam is run to collect and index annotations such as anchortext, tags and click-associated queries. This is followed by the build_spelling_index program which builds the index files needed by the spelling suggester.

The build_spelling_index program harvests suggestions from annotations, title metadata and the corpus lexicon (list of words in all documents). Suggestions in the suggestion index files include a weight which reflects how well the suggestion relates to the collection being searched.

Query processing

When a query is processed which results in fewer than a threshold number of full matches, the suggester is run and will possibly make a suggestion. The threshold increases with the number of documents in the collection.

When considering candidate suggestions, the suggester calculates the similarity between the candidate and the original query. It first calculates an "edit distance" between the two, i.e how many characters would have to be inserted, deleted or replaced to turn one into the other. It then calculates a similarity taking into account the edit distance and the length of the original query.

The score for the candidate is a weighted combination of similarity-to-the-query and the weight computed during indexing. A less similar suggestion may be made if it has a higher weight.

Configuring the spelling index

The following collection.cfg parameters can be used to configure how the spelling index is built, by being passed to the build_spelling_index program:

Configuration option Description

spelling.suggestion_lexicon_weight

Specify weighting to be given to suggestions from the lexicon (list of words from indexed documents) relative to other sources (e.g. annotations)

spelling.suggestion_sources

Specify sources of information for generating spelling suggestions.

spelling.suggestion_threshold

Threshold which controls how suggestions are made.

spelling_enabled

Whether to enable spell checking in the search interface (true or false).

Additional configuration options can be used to blacklist and whitelist terms in the spelling index.

Configuration option Description

Spelling Blacklist

Specifies words that should never appear as spelling suggestions and is read when the spelling index is built.

Spelling Whitelist

Specifies words that should always appear as spelling suggestions and is read when the spelling index is built.

Configuring spelling suggestions at query time

The behaviour of the spelling suggestions system can be controlled at query time by specifying query processor options. This can be done by adding these parameters to the collection.cfg file or specifying them as CGI parameters.

For more information please see the section on spelling related options in the documentation on the query_processor_options parameter.

Automatically applying spelling suggestions

Spelling suggestions can be automatically applied to the query by configuring query blending.

Auto-completion

The spelling suggestion index is also used as the source for simple auto-completion.