Cooler ranking parameter reference
Background
This article provides additional detail on Funnelback’s set of ranking options (or cooler parameters).
Constraints
Every query submitted to Funnelback imposes a number of constraints on the search request.
Constraints are what you are asking from Funnelback: your search terms (i.e. words), any metadata restrictions (which includes facets), URL scope restrictions, etc. Funnelback offers a number of ways to formulate a question and retrieve data. Each individual thing you ask is a constraint that Funnelback will try and match.
Funnelback will partial match by default, so whilst it endeavours to return things that match all your constraints it will happily give you some that match only some (to encourage some results to return). Partial matching refers to results that match some of the contstraints (don’t confuse this with partially matching some of the query words). You can turn this off by setting a query processor option of fmo=true
which will means only return fully matching results.
Some examples of constraints:
-
Hello world
: this is two words, so two constraints. -
"Hello world"
: the quotes make this a phrase, so one constraint. -
"Hello world" -planet
: this is two constraints, the phrase and the negative 'planet' (i.e. you do NOT want that word to appear)
Sand and orsand parameters (items prefixed with a | in the query language) do not count as constraints for the purposes of partial matching in Funnelback
|
Cooler rankings within Funnelback are the different signals (factors/dimensions are other words for it) that make up the rank or score of a document for any given search constraints.
Some of these signals are based on the constraints you provide, others are based on the properties of the documents themselves. E.g. how many words in the document, how long is the URL, etc…
Query processor options (QPOs) are the different configuration parameters you can pass to Funnelback’s query processor (padre). Cooler ranking options are a subset of the query processor options.
Cooler ranking options
The following is a tour of the more commonly used cooler ranking options. See: Padre cooler ranking options for a full list of options.
cool.0
- Content weight-
How well does your search match the content of the document? This is a composite score of many different elements that take your search terms and metadata constraints and run fancy algorithms. Don’t turn this one off, it matters.
cool.1
- On-site links-
Links from within the same domain as this document where the anchor text (words that make up the hyperlink text) contains your constraints.
cool.2
- Off-site links-
Links from other domains where the anchor text contains your constraints. In a normal world this would be trusted more than on-site, since these pages aren’t generally written by the same people as the source document.
cool.3
- URL Length-
The length of the URL. Shorter URLs receive a higher ranking. This amounts for 45% of the score in the default algorithm.
cool.4
- QIE-
Query independent evidence (QIE) that enables an admin to up-weight or down-weight specific groups of content based on URL patterns. This ranking option controls the amount of influence given by QIE.
cool.5
- Date proximity-
How close is the document’s date to the current date? (or a reference date via a query processor option if you need a specific date). You need to have dates on your documents for this to work at all. The further the date is from the document’s date (either in the past or future) will reduce the score.
cool.6
- URL attractiveness-
Homepages up-weighted. Copyright pages and URLS with lots of punctuation down-weighted.
URL type - uses the following heuristics:
-
Server default pages are really good. e.g.
index.html
or just/
except that the homepage of the principal server for an organisation should not be up-weighted. e.g.www.example.com/
is not a good answer to return in a search of the example org butsite.example.com/
is -
Default pages for a directory (e.g.
example.com/somepath/default.asp
) are good but not as good as server default pages -
Filenames whose root matches privacy, copyright, disclaimer, feedback, legals, sitemap etc. should be discouraged. The punctuation score is derived from the number of punctuation characters in the URL, e.g. slashes, dots, question marks, ampersands etc.
-
cool.7
- Annotation weight (ANNIE)-
ANNIE or annotation indexing engine was a relatively late addition to the core code. It processes the anchor data into a separate index and can offer improvements. It needs to be enabled in the indexer options first, this is not used by default so in 99% of cases ignore.
cool.8
- Domain weight-
No longer in use - superceded by the host ranking features (
cool.23
tocool.28
. cool.9
- Proximity to origin-
How close the document’s geospatial coordinates are to the origin (closer is better).
cool.10
- Non-binariness-
Is the document a binary file (pdf, docx, etc…)? If yes, it gets a 0. If no (i.e. it’s a textual file) it gets 1.
cool.11
- Adverts (no_ads)-
Checks for the presence of
google_ad_client
in the body of a HTML document. If yes, it gets a fat 0. If no, it gets a 1. Why? In the generic web crawl world, once upon a time the presence of paid google ads was a negative indicator of a bad actor site. This is questionable now. Do not use. cool.12
- Implicit phrase matching (imp_phrase)-
Boosts docs where your input appears as a phrase in the content. If you provide multiple term constraints (i.e. words) then any document where those words appear right next to one another get a bonus score. E.g. your search for Funnelback search has a document containing Funnelback search engine and it gets a boost. Another document has 'Funnelback is a search engine' and it gets 0.
cool.13
- Consistency of evidence-
Looks for consistency between content and annotation evidence. Checks
cool.0
andcool.7
, if both are non-zero you get a 1, otherwise a 0. Sincecool.7
is not on by default, this does nothing unless you turn it on. cool.14
- Logarithm of annotation weight-
Use this instead of
cool.7
ifcool.7
gives some extreme weighting differences. This is similar tocool.7
but the values for all documents are converted to their logarithm. Why logarithm? Because doing so removes skewing from the data. i.e. if you had some extremely large or smallcool.7
scores, this removes the impact of those outliers. cool.15
- Absolute normalised logarithm annotation weight-
This un-skews the data even more than
cool.14
, so even less variability from outliers. Use if evencool.14
is giving crazy weight differences. As above, but rather than runninglog(absolute_annie)
it first normalised the scores (converted them to a range of 0.0 to 1.0) solog(norm(annie))
. cool.16
- Annotation rank-
Ignore this as it doesn’t do anything.
cool.17
- Field-weighted Okapi score-
Ignore this as it doesn’t do anything.
cool.18
- Absolute normalised Okapi-
This is the normalised score of 'BM-25' which is a well-known algorithm for calculating how relevant a document is against some textual search constraints. We calculate it, this normalises it to minimise the impact of outliers. If you really want to know more about BM-25 see: https://en.wikipedia.org/wiki/Okapi_BM25. TL;DR a type of text relevance score, can be useful for content weighting issues.
cool.19
- Field-weighted Okapi rank-
Ignore this as it doesn’t do anything.
cool.20
- Main hosts bias-
Pretty simple, if the domain of this document is the 'www' domain you get a big 1. If not, you get 0.
cool.21
- Meta component weight-
The influence provided by the meta component weightings. The relative component weightings are set in the
index.sdinfo
file. cool.22
- Document number in the crawl-
actually works in any collection type, not just web. Gives you a score based on your document number in the index. Literally first come, first serve. First gets 1.0, last gets 0.0.
The next few cooler options are host based scores which means they are based on properties of the document domain. All documents in the same domain obviously get the same boost. Sub-domains are considered different domains!
cool.23
- Host incoming link score-
This boosts based on how many links the domain has pointing at in the overall index. More links = more boost.
cool.24
- Host click score-
How many result clicks for documents inside this domain has this domain received? The more clicks, the more boost. A measure of popularity of the domain over time, can be useful. Requires click tracking to be active.
cool.25
- Host linking hosts score-
The count of how many other domains have at least 1 link to this domain. The higher the count, the more boost.
cool.26
- Host linked host score-
How many domains does this domain link have at least 1 link to. The higher the count, the more boost.
cool.27
- Host rank in crawl-
When was this domain encountered in the crawl? Only on web collections. First come, first serve. Naturally has a lot to do with your start URLs.
cool.28
- Domain shallowness-
The more sub-domains a host has, the lower the boost is. i.e. More .'s in your domain name means less score. So
domain.example.com
gets more boost thansubdomain.domain.example.com
. Shallowness means how close is the domain to the top of this domain.
This is the end of the host-specific scores.
cool.29
- Document matches regex-
Quite neat! You set a URL regex via the indexer option
-doc_feature_regex=<regex>
and if a document matches it gets a big 1, otherwise 0. The regex can match things like any URL with/news/
in it for instance. cool.30
- Document does NOT match regex-
The opposite of
cool.30
, equally handy.
The next set of cooler options from cool.31 all the way up to cool.66 are actually only 9 options that each repeat 4 times. The 4 variants are:
|
-
Without any modifiers means it is the score normalised to a scale of 0 to 1. This normalisation reduces the impact of outliers.
-
The
_neg
modifier simple means the negative or inverse. So what was the lowest number is now the highest, and vice versa. Turns 1 on its head basically. -
The
_abs
modifier means absolute. Rather than normalising the scale, it is the literal count for this option. That means big outliers can have big impacts. -
The
_abs_neg
modifier means absolute negative. So it is number 3 but turned on its head.
Instead of going from 31 to 66 I’ll define the 9 parameters instead which covers the 9 signals that go from cool.31
to cool.66
. These can be manually set and be quite useful, especially the title words. The variants exist to be used based on your requirements. The normalised variants have less dramatic impacts versus the absolute.
-
Title words: Count of the number of words in the title of the document.
-
Content words: Count of words indexed for this document. By default, longer documents get more reward (if you use _neg ones it’s shorter docs).
-
Compression factor: Compressibility of the document. I hear the 'huh?', the content is quite literally compressed inside the index and we calculate how much we were able to compress it. By default, documents that are less compressible have less repeating segments, so are rewarded more.
-
Document entropy: More huh? This one is slightly more complex and uses a formula by some gents called Bendersky & Croft. In layman terms, it calculates how predictable the text is based on how often words reoccur. The default rewards unpredictable text as it is likely more 'unique' content.
-
Stopword fraction: Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. By default documents with less stop words are rewarded.
-
Stop word cover: How many different stop words does the document use? By default, the less variants of stop words the document has, the better.
-
Average term length: How long are the words in the document on average? Documents with longer words get more reward by default.
-
Distinct words: How many unique words does the document contain? The more unique words, the more reward by default.
-
Maximum term frequency: For each document, calculates the most occurring words and then takes the highest of those. The lower that count is, the more reward by default.
cool.67
- lexical span-
One of my favourites. In function it is related to
cool.12
'implicit phrase'. Where cool.12 is a binary 1 or 0 reward (the words appear either as a phrase or do not) this is a more granular reward system. Similarly tocool.12
, this only takes effect if you enter multiple words. Lexical span measures 'how close together are the words in the text?'. As an example, you search for 'Funnelback search'. If the words are in a phrase, the lexical span is 0. If the words appear as 'Funnelback is a search engine' the lexical span for 'Funnelback search' is 2 because there are 2 words between what you are looking for. Can be quite useful! The following 4 cooler options come in pairs, are all manually set and allow you to up- or downweight gscopes. cool.68
- Document matchescgscope1
-
If
cgscope1=<your_gscope>
is set, any document matching this is rewarded. cool.69
- Document matchescgscope2
-
Same as above but allows you to use
cgscope2
in your request (i.e. you can use 2 at once). cool.70
- Document does not matchcgscope1
-
The opposite of
cool.68
. cool.71
- Document does not matchcgscope2
-
The opposite of
cool.69
. cool.72
- Raw ANNIE-
The raw (absolute) ANNIE score for a document, see
cool.7
for details. Has a more dramatic impact thancool.7
.
That concludes all the cooler options that currently exist with Funnelback.
The default algorithm uses -cool.0=41 -cool.2=14 -cool.3=45
. That’s it, so use tuning and tweak where it makes sense.