Customising Funnelback: configuring ranking
Funnelback's ranking algorithm determines what results are retrieved from the index and what how the order of relevance is determined.
The ranking of results is a complex problem, influenced by a multitude of document attributes. It's not just about how many times a word appears within a document's content.
- Ranking options are a subset of the query processor options which also control other aspects of query time behaviour (such as display settings).
- Ranking options are applied at query time - this means that different services and profiles can have different ranking settings applied, on an identical index. Ranking options can also be changed via CGI parameters at the time the query is submitted.
Tuning is a process that can be used to determine which attributes of a document are indicative of relevance and adjust the ranking algorithm to match these attributes.
Tuning requires the specification of a set of queries and best answers that are uses as a training data set to optimise the ranking algorithm.
See: automated tuning
Setting ranking indicators
Funnelback has an extensive set of ranking parameters that influence how the ranking algorithm operates.
This allows for customisation of the influence provided by 73 different ranking indicators.
Note: Automated tuning should be used (where possible) to set ranking influences as manually altering influences can result in fixing of a specific problem at the expense of the rest of the content.
The main ranking indicators are:
- Content: This is controlled by the
cool.0parameter and is used to indicate the influence provided by the document's content score.
- On-site links: This is controlled by the
cool.1parameter and is used to indicate the influence provided by the links within the site. This considers the number and text of incoming links to the document from other pages within the same site.
- Off-site links: This is controlled by the
cool.2parameter and is used to indicate the influence provided by the links outside the site. This considers the number and text of incoming links to the document from external sites in the index.
- Length of URL: This is controlled by the
cool.3parameter and is used to indicate the influence provided by the length of the document's URL. Shorter URLs generally indicate a more important page.
- External evidence: This is controlled by the
cool.4parameter and is used to indicate the influence provided via external evidence (see query independent evidence below).
- Recency: This is controlled by the
cool.5parameter and is used to indicate the influence provided by the age of the document. Newer documents are generally more important than older documents.
Applying ranking options
Ranking options are applied in one of three ways:
- Set as a default for the collection by adding the ranking option to the
query_processor_optionsparameter in the collection's configuration. This can be done via the 'Edit Collection Configuration' screens on the admin home page.
- Set as a default for the profile by adding the ranking option to the list of options defined in the profile's
padre_opts.cfg. This can be done by editing the
padre_opts.cfgfile within the relevant profile folder directly, or by editing
padre_opts.cfgfor the relevant profile from the file manager screen in the administration interface for the collection.
- Set at query time by adding the ranking option as a CGI parameter. This is a good method for testing but should be avoided in production unless the ranking factor needs to be dynamically set for each query, or set by a search form control such as a slider.
Many ranking options can be set simultaneously, with the ranking algorithm automatically normalising all of the supplied ranking factors. E.g.
query_processor_options=-stem=2 -cool.1=0.7 -cool.5=0.3 -cool.21=0.24
Automated tuning is the recommended way of setting these ranking parameters as it uses an optimisation process to determine the optimal set of factors. Manual tuning can result in an overall poorer end result as improving one particular search might impact negatively on a lot of other searches.
Meta collection component weighting
When different collections are combined into a meta collection it is often beneficial to weight the collections differently. This can be for a number of reasons, the main ones being:
- Some collections are simply more important than others. E.g. a university's main website is likely to be more important than a department's website.
- Some collection types naturally rank better than others. E.g. web collections generally rank better than other collection types as there is a significant amount of additional ranking information that can be inferred from attributes such as the number of incoming links, the text used in these links and page titles. XML and database collections generally have few attributes beyond the record content that can be used to assist with ranking.
Meta collection component weighting is controlled using the
By default Funnelback will track which results are click on by a user for any query that is run.
This information can be utilised by Funnelback to improve ranking over time by learning from this recorded user behaviour.
Result diversification and collapsing
There are a number of ranking options that are designed to increase the diversity of the result set. These options can be used to reduce the likelihood of result sets being flooded by results from the same website, collection etc.
Result collapsing can also be used to group together consecutive similar results.
It is often desirable to up (or down) weight a search result when search keywords appear in specified metadata fields. Funnelback provides ranking options to set individual metadata fields to consider and also relative weightings to apply.
Query independent evidence
Query independent evidence (QIE) allows certain pages or groups of pages within a website (based on a regular expression match to the document's URL) to be upweighted or downweighted without any consideration of the query being run.
Funnelback's SEO auditor tool can be used to investigate ranking for specific queries and URLs, and provides advice on how to improve the ranking of the document.
See: SEO auditor