Funnelback logo

Documentation

CATEGORY

Funnelback Ranking Algorithms

Introduction

More effective search delivers considerable value to searchers, by finding information and services which would otherwise remain hidden or by saving time. An ideal search engine would find all useful information and services within the domain being searched, return them to the searcher in order of decreasing value to the searcher (taking into account overlap with what has already been presented earlier in the list) and present each result in a way which facilitates assessment of its value.

This document describes how out-of-the-box Funnelback ranks documents, explains how to modify settings to meet particular needs and links to documentation on monitoring and measuring effectiveness of search at your site.

Funnelback, like other leading search engines, is based on statistical algorithms and "understands" neither the query nor the content of documents. Accordingly, evaluation of the quality of search should be based on a large number (50+) of queries.

Initial ranking

If you are not yet familiar with Funnelback's query language, please refer to the user help pages on the query language.

Funnelback assigns each document a utility score derived from the factors listed below and records how many of the constraints implicit in the query (in simple queries, each query word counts as a constraint) it satisfied. Results are then sorted into 'tiers' by decreasing number of constraints matched. Results within these 'tiers' are by default listed in decreasing order by score but may be modified to be ranked on numerous other bases (e.g. alphabetical order, date, etc...)

Document content

The particular relevance formula used is a slightly modified form of the Okapi BM25 function developed by Steve Robertson and Steve Walker of City University, London. BM25 embodies the following heuristics:

  • The more occurrences of a query word in a document the better.
  • Shorter documents are preferred to long ones.
  • Query words which occur in fewer documents (lower document frequency) are more discriminating and receive higher weight.

By default, document content includes text plus title and subject, keywords or description metadata. In document content scoring, each word occurrence counts equally, whether in the body of the document or in the title or in metadata, and the document frequency is a count of all documents which contain at least one occurrence in any of these places.

Document metadata

The Okapi BM25 formula is also used for scoring metadata. Metadata explicitly comes into play when a query term includes a metadata field specification but also implicitly when a simple query term is processed. For example, if the query were library, a document content score would be calculated as described in the previous item, but it would be augmented by the scores for occurrences of the query word in the title of the document (i.e. it would be as if the query term t:library was added to the query). Note that this process may result in substantial score boosts to documents containing library in their title due to lower document frequency. (i.e. there may be thousands of documents containing library but probably only a few contain the word in their title) It is possible to configure Funnelback to specify a different set of implicit metadata fields (see below.) The implicit metadata terms included in each query by default are:

  • Title metadata (metadata class t): Words that occur in the 'title' section of each document
  • Incoming anchor text (metadata class k): Funnelback collects together all the anchor text (the words you click on in your browser in order to follow a link) and associates them with the target document as k: metadata. In a web site, anchor text can provide a very strong signal that a certain page is the best answer to a particular query. For example, thousands of pages may contain the word library but probably only a very small percentage contain the word in their k: metadata. Of the pages which do, the k: metadata for the main library home page may contain hundreds more occurrences than any other page. Anchor text scoring uses a different function which removes the length heuristic and increases the weight given to documents with very large numbers of occurrences of query words in the incoming anchor text.
  • Click data (metadata class K): Funnelback can be set up to take into account the number of times users have clicked on a particular resource when using Funnelback search. Every click counts as an occurrence of the search terms used in the k: metadata class. Using click data can improve search results by leveraging the users ability to discern what are and aren't important results. This means that results that are clicked on more will be boosted in the rankings.

Onsite and Offsite inlink count

Anchor text provides strong query-dependent evidence that some pages are more important than others which also match the query. Sometimes it is useful to take into account query-independent evidence. This might be the case if the site used poor anchor text (e.g. "click here") or if the query did not use the same language as the anchor text (e.g. the query were books rather than library.) Funnelback counts up the number of incoming links and interprets them as votes that this page is useful. The count distinguishes links from the same site and those from outside and weights them differently.

URL length

Generally, important pages (particularly site home pages) have shorter URLs. Funnelback derives a score based on URL length - the shorter the URL the higher the score.

Query independent evidence (QIE)

Evidence can be used that ranks a site's importance regardless of the query that has been issued. This is useful for manually boosting the rankings of important sites or pages. Careless use of this can significantly harm result quality though. More information is available in a separate document on query independent evidence.

Recency

Recent information may be more valuable than old. For example, people may prefer to read the 2003 annual report than the one for 2002. Furthermore, link-based measures such as anchor text and in-link counts are biased against new pages because it takes time for web authors to notice new pages and link to them.

Resorting and Same Site Suppression

Once the initial ranking has been derived as described above, results may be resorted if requested by either a query processor option (-sort) or a CGI parameter. This resorting can be performed in ascending or descending order by:

  • Date of the document (Note not all documents have date information included in them so use this option with care)
  • Title of the document
  • Size of the document (In bytes)
  • URL of the document
  • Collection that the document has come from
  • By a specified metadata class.

In each of these cases the score of the document is used to break ties.

Sorting may be performed on top ranking documents (i.e. Those documents that are in the top 'tier') or performed for every 'tier' of results, using the -sortall query processor option. (Note that resorting will not move documents between 'tiers')

Same site suppression can be used (with the -SSS query processor option) where explicit resorting is not specified. In cases where a number of sites contribute results for a query, it is undesirable that the result ranking be flooded with results from a single source. Funnelback addresses this problem by down weighting additional results from the same site. Using the default settings the second result from a site receives 70% of its raw score, the third receives 58%, the fourth 50% and so on (inverse square root).

By default, same site suppression is enabled for web collections, using a setting of -SSS2. In this case sites are not distinguished below the second directory level. e.g. a.b.c/d/ would be considered a separate site from a.b.c but a.b.c/d/e/ is considered part of a.b.c/d/. Same site suppression can be disabled using a setting of (-SSS0).

The same site suppression penalty can be tuned using the -SameSiteSuppressionExponent query processor option, which at it's default setting of 0.5 creates the inverse square root relation. Changing this to 0.33 for example would change the penalty function to be an inverse cube root relation. The following diagram shows how successive documents from the same site will have their scores downweighted using several different SameSiteSupressionExponents as an example

SSSexponent.PNG

Customising Funnelback Ranking

Funnelback's default settings will work well for most organisations, but result quality can usually be improved by tuning Funnelback to suit your organisations particular information characteristics. Funnelback default settings have arisen out of optimisation experiments, using hundreds of queries and thousands of combinations of settings, using a gradient-based optimisation framework. Unless you know what you are doing, changing defaults on the basis of intuition is likely to make things worse! If you do decide to change things, please remember that in most web sites, anchor text, in-links and URL structure provide the best evidence of which are the best answers among the hundreds of matching documents. In general, the more web pages are indexed the stronger the link and anchor evidence in favour of key sites. By contrast, the more web pages are indexed the more likely that query words will be found in content, title or metadata of pages which are not key sites.

Measuring result quality

Funnelback provides an interface for creating sets of test queries and expected answers, and can produce quality measurements for your site. Please be aware that the selection of test queries is crucial to producing good results, and simply selecting the most popular queries for a site introduces a bias which may substantially harm the result quality of less frequent queries if the set is used as the basis for tuning.

Please see the test.data editing document for details on creating and editing test data.

Changing scoring mode using -sco query processor option

  • The default setting for the Funnelback scoring mode is scoring mode 2 (-sco2) and it results in the behaviour described above.
  • You can augment this mode to include a URL metadata term as well in your searches by using the -uv query processor option.
  • You can also turn off all the implicit metadata and anchor text scoring by selecting scoring mode 0 (-sco0).
  • Scoring mode 6 (-sco6) removes the bias in favour of shorter documents.
  • Scoring mode 7 (-sco7) allows you to specify the metadata terms that will be implicitly included in each query. -sco7kKt for example would have very similar effect to -sco2 but -sco7nopq would include source, coverage, publisher, and function instead of anchor text, title and URL.

Changing COOLER scoring

The elements that make up the acronym "COOLER" are shown in the table below:

Element CGI Parameter Default
Content (implicitly includes metadata classes) cool0 0.41
Onsite in-links cool1 0
Offsite in-links cool2 0.14
Length of URL cool3 0.45
External evidence (query independent evidence) cool4 0
Recency cool5 0
urltype: URL attractiveness (Homepages favoured. Copyright pages and URLS with lots of punctuation deprecated.) cool6 0
mainhosts: bias in favour of principal servers (web search only). cool20 0
Comp_wt: component collection weighting. (meta collections only). cool21 0

You can change the relative weightings for these factors by specifying -cooler followed by six integers as a query processor option. For example:

-cooler 10 0 0 80 10 0 

would give a small weight to content, no weight to in-links , very heavy weight to URL length, a small weight to query independent evidence and no weight to recency. The integer values are normalized to the range 0-1, with the largest value specified receiving a value of 1.

Cooler values can be set individually using -cool<n>, e.g. -cool5=0.5 changes the recency weigth to 0.5. The same effect is achieved by the CGI parameter assignment '&cool5=50'. Note that the CGI values are interpreted as percentages.

Before a query is processed, the raw weights for the cool values are internally normalised to sum to 1.0.

Cooler scoring is enabled by default but can be turned off using the query processor option -nocool.

Additional Ranking Options

A number of other query processor options may be used to alter the way Funnelback ranks results. For example -wmeta k option allows you to increase or decrease the weight of anchor text relative to other content. -wmeta k 0.5 is the default - you can either increase or decrease it.

Some of these optional parameters are listed in the following table:

Ranking Factor Query Processor Option CGI Parameter Default
Title -wmeta t <weight> wmeta_t 1
Description -wmeta c <weight> wmeta_c 1
Subject -wmeta s <weight> wmeta_s 1
Anchortext -wmeta k <weight> wmeta_k 0.5
Click data -wmeta K <weight> wmeta_K 0.5

Further Reading

Relevance ranking is a very complex subject and is the focus of ongoing research both by CSIRO and in other research laboratories around the world. If you would like to learn more about it than can be covered in this brief document, please check out:

See Also

top ⇑