Cooler ranking parameter reference

Background

This article provides additional detail on Funnelback’s set of ranking options (or cooler parameters).

Constraints

Every query submitted to Funnelback imposes a number of constraints on the search request.

Constraints are what you are asking from Funnelback: your search terms (i.e. words), any metadata restrictions (which includes facets), URL scope restrictions, etc. Funnelback offers a number of ways to formulate a question and retrieve data. Each individual thing you ask is a constraint that Funnelback will try and match.

Funnelback will partial match by default, so whilst it endeavours to return things that match all your constraints it will happily give you some that match only some (to encourage some results to return). Partial matching refers to results that match some of the contstraints (don’t confuse this with partially matching some of the query words). You can turn this off by setting a query processor option of fmo=true which will means only return fully matching results.

Some examples of constraints:

  • Hello world: this is two words, so two constraints.

  • "Hello world": the quotes make this a phrase, so one constraint.

  • "Hello world" -planet: this is two constraints, the phrase and the negative 'planet' (i.e. you do NOT want that word to appear)

Sand and orsand parameters (items prefixed with a | in the query language) do not count as constraints for the purposes of partial matching in Funnelback

Cooler rankings within Funnelback are the different signals (factors/dimensions are other words for it) that make up the rank or score of a document for any given search constraints.

Some of these signals are based on the constraints you provide, others are based on the properties of the documents themselves. E.g. how many words in the document, how long is the URL, etc…​

Query processor options (QPOs) are the different configuration parameters you can pass to Funnelback’s query processor (padre). Cooler ranking options are a subset of the query processor options.

Cooler ranking options

The following is a tour of the more commonly used cooler ranking options. See: Padre cooler ranking options for a full list of options.

cool.0 - Content weight

How well does your search match the content of the document? This is a composite score of many different elements that take your search terms and metadata constraints and run fancy algorithms. Don’t turn this one off, it matters.

cool.1 - On-site links

Links from within the same domain as this document where the anchor text (words that make up the hyperlink text) contains your constraints.

cool.2 - Off-site links

Links from other domains where the anchor text contains your constraints. In a normal world this would be trusted more than on-site, since these pages aren’t generally written by the same people as the source document.

cool.3 - URL Length

The length of the URL. Shorter URLs receive a higher ranking. This amounts for 45% of the score in the default algorithm.

cool.4 - QIE

Query independent evidence (QIE) that enables an admin to up-weight or down-weight specific groups of content based on URL patterns. This ranking option controls the amount of influence given by QIE.

cool.5 - Date proximity

How close is the document’s date to the current date? (or a reference date via a query processor option if you need a specific date). You need to have dates on your documents for this to work at all. The further the date is from the document’s date (either in the past or future) will reduce the score.

cool.6 - URL attractiveness

Homepages up-weighted. Copyright pages and URLS with lots of punctuation down-weighted.

URL type - uses the following heuristics:

  • Server default pages are really good. e.g. index.html or just / except that the homepage of the principal server for an organisation should not be up-weighted. e.g. www.example.com/ is not a good answer to return in a search of the example org but site.example.com/ is

  • Default pages for a directory (e.g. example.com/somepath/default.asp) are good but not as good as server default pages

  • Filenames whose root matches privacy, copyright, disclaimer, feedback, legals, sitemap etc. should be discouraged. The punctuation score is derived from the number of punctuation characters in the URL, e.g. slashes, dots, question marks, ampersands etc.

cool.7 - Annotation weight (ANNIE)

ANNIE or annotation indexing engine was a relatively late addition to the core code. It processes the anchor data into a separate index and can offer improvements. It needs to be enabled in the indexer options first, this is not used by default so in 99% of cases ignore.

cool.8 - Domain weight

No longer in use - superceded by the host ranking features (cool.23 to cool.28.

cool.9 - Proximity to origin

How close the document’s geospatial coordinates are to the origin (closer is better).

cool.10 - Non-binariness

Is the document a binary file (pdf, docx, etc…​)? If yes, it gets a 0. If no (i.e. it’s a textual file) it gets 1.

cool.11 - Adverts (no_ads)

Checks for the presence of google_ad_client in the body of a HTML document. If yes, it gets a fat 0. If no, it gets a 1. Why? In the generic web crawl world, once upon a time the presence of paid google ads was a negative indicator of a bad actor site. This is questionable now. Do not use.

cool.12 - Implicit phrase matching (imp_phrase)

Boosts docs where your input appears as a phrase in the content. If you provide multiple term constraints (i.e. words) then any document where those words appear right next to one another get a bonus score. E.g. your search for Funnelback search has a document containing Funnelback search engine and it gets a boost. Another document has 'Funnelback is a search engine' and it gets 0.

cool.13 - Consistency of evidence

Looks for consistency between content and annotation evidence. Checks cool.0 and cool.7, if both are non-zero you get a 1, otherwise a 0. Since cool.7 is not on by default, this does nothing unless you turn it on.

cool.14 - Logarithm of annotation weight

Use this instead of cool.7 if cool.7 gives some extreme weighting differences. This is similar to cool.7 but the values for all documents are converted to their logarithm. Why logarithm? Because doing so removes skewing from the data. i.e. if you had some extremely large or small cool.7 scores, this removes the impact of those outliers.

cool.15 - Absolute normalised logarithm annotation weight

This un-skews the data even more than cool.14, so even less variability from outliers. Use if even cool.14 is giving crazy weight differences. As above, but rather than running log(absolute_annie) it first normalised the scores (converted them to a range of 0.0 to 1.0) so log(norm(annie)).

cool.16 - Annotation rank

Ignore this as it doesn’t do anything.

cool.17 - Field-weighted Okapi score

Ignore this as it doesn’t do anything.

cool.18 - Absolute normalised Okapi

This is the normalised score of 'BM-25' which is a well-known algorithm for calculating how relevant a document is against some textual search constraints. We calculate it, this normalises it to minimise the impact of outliers. If you really want to know more about BM-25 see: https://en.wikipedia.org/wiki/Okapi_BM25. TL;DR a type of text relevance score, can be useful for content weighting issues.

cool.19 - Field-weighted Okapi rank

Ignore this as it doesn’t do anything.

cool.20 - Main hosts bias

Pretty simple, if the domain of this document is the 'www' domain you get a big 1. If not, you get 0.

cool.21 - Meta component weight

The influence provided by the meta component weightings. The relative component weightings are set in the index.sdinfo file.

cool.22 - Document number in the crawl

actually works in any collection type, not just web. Gives you a score based on your document number in the index. Literally first come, first serve. First gets 1.0, last gets 0.0.

The next few cooler options are host based scores which means they are based on properties of the document domain. All documents in the same domain obviously get the same boost. Sub-domains are considered different domains!

cool.23 - Host incoming link score

This boosts based on how many links the domain has pointing at in the overall index. More links = more boost.

cool.24 - Host click score

How many result clicks for documents inside this domain has this domain received? The more clicks, the more boost. A measure of popularity of the domain over time, can be useful. Requires click tracking to be active.

cool.25 - Host linking hosts score

The count of how many other domains have at least 1 link to this domain. The higher the count, the more boost.

cool.26 - Host linked host score

How many domains does this domain link have at least 1 link to. The higher the count, the more boost.

cool.27 - Host rank in crawl

When was this domain encountered in the crawl? Only on web collections. First come, first serve. Naturally has a lot to do with your start URLs.

cool.28 - Domain shallowness

The more sub-domains a host has, the lower the boost is. i.e. More .'s in your domain name means less score. So domain.example.com gets more boost than subdomain.domain.example.com. Shallowness means how close is the domain to the top of this domain.

This is the end of the host-specific scores.

cool.29 - Document matches regex

Quite neat! You set a URL regex via the indexer option -doc_feature_regex=<regex> and if a document matches it gets a big 1, otherwise 0. The regex can match things like any URL with /news/ in it for instance.

cool.30 - Document does NOT match regex

The opposite of cool.30, equally handy.

The next set of cooler options from cool.31 all the way up to cool.66 are actually only 9 options that each repeat 4 times. The 4 variants are:
  • Without any modifiers means it is the score normalised to a scale of 0 to 1. This normalisation reduces the impact of outliers.

  • The _neg modifier simple means the negative or inverse. So what was the lowest number is now the highest, and vice versa. Turns 1 on its head basically.

  • The _abs modifier means absolute. Rather than normalising the scale, it is the literal count for this option. That means big outliers can have big impacts.

  • The _abs_neg modifier means absolute negative. So it is number 3 but turned on its head.

Instead of going from 31 to 66 I’ll define the 9 parameters instead which covers the 9 signals that go from cool.31 to cool.66. These can be manually set and be quite useful, especially the title words. The variants exist to be used based on your requirements. The normalised variants have less dramatic impacts versus the absolute.

  • Title words: Count of the number of words in the title of the document.

  • Content words: Count of words indexed for this document. By default, longer documents get more reward (if you use _neg ones it’s shorter docs).

  • Compression factor: Compressibility of the document. I hear the 'huh?', the content is quite literally compressed inside the index and we calculate how much we were able to compress it. By default, documents that are less compressible have less repeating segments, so are rewarded more.

  • Document entropy: More huh? This one is slightly more complex and uses a formula by some gents called Bendersky & Croft. In layman terms, it calculates how predictable the text is based on how often words reoccur. The default rewards unpredictable text as it is likely more 'unique' content.

  • Stopword fraction: Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. By default documents with less stop words are rewarded.

  • Stop word cover: How many different stop words does the document use? By default, the less variants of stop words the document has, the better.

  • Average term length: How long are the words in the document on average? Documents with longer words get more reward by default.

  • Distinct words: How many unique words does the document contain? The more unique words, the more reward by default.

  • Maximum term frequency: For each document, calculates the most occurring words and then takes the highest of those. The lower that count is, the more reward by default.

cool.67 - lexical span

One of my favourites. In function it is related to cool.12 'implicit phrase'. Where cool.12 is a binary 1 or 0 reward (the words appear either as a phrase or do not) this is a more granular reward system. Similarly to cool.12, this only takes effect if you enter multiple words. Lexical span measures 'how close together are the words in the text?'. As an example, you search for 'Funnelback search'. If the words are in a phrase, the lexical span is 0. If the words appear as 'Funnelback is a search engine' the lexical span for 'Funnelback search' is 2 because there are 2 words between what you are looking for. Can be quite useful! The following 4 cooler options come in pairs, are all manually set and allow you to up- or downweight gscopes.

cool.68 - Document matches cgscope1

If cgscope1=<your_gscope> is set, any document matching this is rewarded.

cool.69 - Document matches cgscope2

Same as above but allows you to use cgscope2 in your request (i.e. you can use 2 at once).

cool.70 - Document does not match cgscope1

The opposite of cool.68.

cool.71 - Document does not match cgscope2

The opposite of cool.69.

cool.72 - Raw ANNIE

The raw (absolute) ANNIE score for a document, see cool.7 for details. Has a more dramatic impact than cool.7.

That concludes all the cooler options that currently exist with Funnelback.

The default algorithm uses -cool.0=41 -cool.2=14 -cool.3=45. That’s it, so use tuning and tweak where it makes sense.