Funnelback Ranking Algorithms



More effective search delivers considerable value to searchers, by finding information and services which would otherwise remain hidden or by saving time. An ideal search engine would find all useful information and services within the domain being searched, return them to the searcher in order of decreasing value to the searcher (taking into account overlap with what has already been presented earlier in the list) and present each result in a way which facilitates assessment of its value.

This document describes how out-of-the-box Funnelback ranks documents, explains how to modify settings to meet particular needs and links to documentation on monitoring and measuring effectiveness of search at your site.

Funnelback, like other leading search engines, is based on statistical algorithms and "understands" neither the query nor the content of documents. Accordingly, evaluation of the quality of search should be based on a large number (50+) of queries.

Initial ranking

If you are not yet familiar with Funnelback's query language, please refer to the user help pages on the query language.

Funnelback assigns each document a utility score derived from the factors listed below and records how many of the constraints implicit in the query (in simple queries, each query word counts as a constraint) it satisfied. Results are then sorted into 'tiers' by decreasing number of constraints matched. Results within these 'tiers' are by default listed in decreasing order by score but may be modified to be ranked on numerous other bases (e.g. alphabetical order, date, etc...)

Document content

The particular relevance formula used is a slightly modified form of the Okapi BM25 function developed by Steve Robertson and Steve Walker of City University, London. BM25 embodies the following heuristics:

By default, document content includes text plus title and subject, keywords or description metadata. In document content scoring, each word occurrence counts equally, whether in the body of the document or in the title or in metadata, and the document frequency is a count of all documents which contain at least one occurrence in any of these places.

Document metadata

The Okapi BM25 formula is also used for scoring metadata. Metadata explicitly comes into play when a query term includes a metadata field specification but also implicitly when a simple query term is processed. For example, if the query were library, a document content score would be calculated as described in the previous item, but it would be augmented by the scores for occurrences of the query word in the title of the document (i.e. it would be as if the query term t:library was added to the query). Note that this process may result in substantial score boosts to documents containing library in their title due to lower document frequency. (i.e. there may be thousands of documents containing library but probably only a few contain the word in their title) It is possible to configure Funnelback to specify a different set of implicit metadata fields (see below.) The implicit metadata terms included in each query by default are:

Onsite and Offsite inlink count

Anchor text provides strong query-dependent evidence that some pages are more important than others which also match the query. Sometimes it is useful to take into account query-independent evidence. This might be the case if the site used poor anchor text (e.g. "click here") or if the query did not use the same language as the anchor text (e.g. the query were books rather than library.) Funnelback counts up the number of incoming links and interprets them as votes that this page is useful. The count distinguishes links from the same site and those from outside and weights them differently.

URL length

Generally, important pages (particularly site home pages) have shorter URLs. Funnelback derives a score based on URL length - the shorter the URL the higher the score.

Query independent evidence (QIE)

Evidence can be used that ranks a site's importance regardless of the query that has been issued. This is useful for manually boosting the rankings of important sites or pages. Careless use of this can significantly harm result quality though. More information is available in a separate document on query independent evidence.


Recent information may be more valuable than old. For example, people may prefer to read the 2003 annual report than the one for 2002. Furthermore, link-based measures such as anchor text and in-link counts are biased against new pages because it takes time for web authors to notice new pages and link to them.

Document Matching

Funnelback supports two document matching approaches:

The bigger picture

As of version 12.0 there are more than 70 tunable parameters in the Funnelback ranking algorithm. Out-of-the-box, more than a dozen of these have non-zero values.

We strongly recommend the use of our tuning system, made accessible and easy-to-use via our administrator interface to achieve the best combination of settings for your data and your users. Manual configuration of the system is possible but often achieves poor results because a human tends to focus on too few test queries and too few of the ranking parameters. A massive change in the weight for a parameter may fix a ranking issue for one query but at the same time harm the ranking of a hundred others. In contrast, if our automated tuner is given a representative set of 100 query-best-answers combinations, it can systematically explore the space of possible parameter settings and optimise performance over the whole workload of the search installation.

Resorting the result list

Once the initial ranking has been derived as described above, results may be resorted if requested by either a query processor option (-sort) or a URL parameter. This resorting can be performed in ascending or descending order by:

In each of these cases the score of the document is used to break ties.

Sorting may be performed on top ranking documents (i.e. Those documents that are in the top 'tier') or performed for every 'tier' of results, using the -sortall query processor option. (Note that resorting will not move documents between 'tiers')

When sorting non-English strings in titles, URLs or metadata, you will need to specify the collation order which should apply. This is done by specifying the query processor option -lang= or the corresponding CGI parameter. For example -lang=jp or -lang=jp_JP to achieve sorting appropriate for Japanese. Collation orders and other language specifics are usually defined by the host operating system as 'locales'. To achieve correct sorting for a language the system administrator for the Funnelback servers may need to install a language pack for the operating system.

Collation rules for different languages are quite different. For example, in Czech (locale cs_CZ) accented versions of letters are grouped after the unaccented letter, and the 'ch' combination comes between h and i in the collation order.

Result set diversification

Same site suppression Same site suppression can be used (with the -SSS query processor option) where explicit resorting is not specified. In cases where a number of sites contribute results for a query, it is undesirable that the result ranking be flooded with results from a single source. Funnelback addresses this problem by down weighting additional results from the same site. Using the default settings the second result from a site receives 70% of its raw score, the third receives 58%, the fourth 50% and so on (inverse square root).

By default, same site suppression is enabled for web collections, using a setting of -SSS2. In this case sites are not distinguished below the second directory level. e.g. a.b.c/d/ would be considered a separate site from a.b.c but a.b.c/d/e/ is considered part of a.b.c/d/. Same site suppression can be disabled using a setting of (-SSS0).

The same site suppression penalty can be tuned using the -SameSiteSuppressionExponent query processor option, which at it's default setting of 0.5 creates the inverse square root relation. Changing this to 0.33 for example would change the penalty function to be an inverse cube root relation. The following diagram shows how successive documents from the same site will have their scores downweighted using several different SameSiteSupressionExponents as an example


Other forms of diversification Same site suppression was the first diversification tool provided by Funnelback and was introduced in around 2002. Since then a number of additional mechanisms have been introduced to assist in avoiding undesirable homogeneity in results returned. For example, documents can be pushed down the ranking

See the documentation of query processor options for details.

Customising Funnelback Ranking

Funnelback's default settings will work well for most organisations, but result quality can usually be improved by tuning Funnelback to suit your organisations particular information characteristics. Funnelback default settings have arisen out of optimisation experiments, using hundreds of queries and thousands of combinations of settings, using a gradient-based optimisation framework. As noted above, unless you know what you are doing, changing defaults on the basis of intuition is likely to make things worse!

Measuring result quality

Funnelback provides an interface for creating sets of test queries and expected answers, and can produce quality measurements for your site. The measurement system is an essential component of automated tuning.

Please be aware that the selection of test queries is crucial to producing good results, and simply selecting the most popular queries for a site introduces a bias which may substantially harm the result quality of less frequent queries if the set is used as the basis for tuning.

Changing scoring mode using -sco query processor option

Further Reading

Relevance ranking is a very complex subject and is the focus of ongoing research both by CSIRO and in other research laboratories around the world. If you would like to learn more about it than can be covered in this brief document, please check out:

See Also