Padre cooler ranking options

Every query submitted to Funnelback imposes a number of constraints on the search request.

Constraints are what you are asking from Funnelback: your search terms (i.e. words), any metadata restrictions (which includes facets), URL scope restrictions, etc. Funnelback offers a number of ways to formulate a question and retrieve data. Each individual thing you ask is a constraint that Funnelback will try and match.

Funnelback will partial match by default, so whilst it endeavours to return things that match all your constraints it will happily give you some that match only some (to encourage some results to return). Partial matching refers to results that match some of the contstraints (don’t confuse this with partially matching some of the query words). You can turn this off by setting a query processor option of fmo=true which will means only return fully matching results.

Some examples of constraints:

Hello world: this is two words, so two constraints.

"Hello world": the quotes make this a phrase, so one constraint.

"Hello world" -planet: this is two constraints, the phrase and the negative 'planet' (i.e. you do NOT want that word to appear)

sand and orsand parameters (items prefixed with a | in the query language) do not count as constraints for the purposes of partial matching in Funnelback

Cooler rankings within Funnelback are the different signals (factors/dimensions are other words for it) that make up the rank or score of a document for any given search constraints.

Some of these signals are based on the constraints you provide, others are based on the properties of the documents themselves. E.g. how many words in the document, how long is the URL.

Query processor options (QPOs) are the different configuration parameters you can pass to Funnelback’s query processor (padre). Cooler ranking options are a subset of the query processor options.

Cooler ranking options

This page describes the cooler ranking options which define the main influences that are applied to the Funnelback ranking algorithm.

These options can either be set in query processor options (collection.cfg) or using CGI parameters (e.g. …​&cool.2=12&cool.3=34…​).

The set of cooler options from cool.23 to cool.28 replace the cool.8 option and are host based scores which means they are based on properties of the document domain. All documents in the same domain obviously get the same boost. Sub-domains are considered different domains!

The set of cooler options from cool.31 to cool.66 represent nine content-related ranking options that can be applied in four different ways (indicated by a modifier on the option name):

  • Without any modifiers means it is the score normalised to a scale of 0 to 1. This normalisation reduces the impact of outliers.

  • The _neg modifier simple means the negative or inverse. So what was the lowest number is now the highest, and vice versa. Turns 1 on its head basically.

  • The _abs modifier means absolute. Rather than normalising the scale, it is the literal count for this option. That means big outliers can have big impacts.

  • The _abs_neg modifier means absolute negative. So it is number 3 but turned on its head.

The default algorithm uses -cool.0=41 -cool.2=14 -cool.3=45 with the remainder set to 0.

cool.0 - Content weight

How well does your search match the content of the document? This is a composite score of many different elements that take your search terms and metadata constraints and run fancy algorithms. Don’t turn this one off, it matters.

cool.1 - On-site link weight

Links from within the same domain as this document where the anchor text (words that make up the hyperlink text) contains your constraints.

cool.2 - Off-site link weight

Links from other domains where the anchor text contains your constraints. In a normal world this would be trusted more than on-site, since these pages aren’t generally written by the same people as the source document.

cool.3 - URL length weight

The length of the URL. Shorter URLs receive a higher ranking. This amounts for 45% of the score in the default algorithm.

cool.4 - External evidence (QIE) weight

Query independent evidence (QIE) that enables an admin to up-weight or down-weight specific groups of content based on URL patterns. This ranking option controls the amount of influence given by QIE.

cool.5 - Date proximity weight

How close is the document’s date to the current date? (or a reference date via a query processor option if you need a specific date). You need to have dates on your documents for this to work at all. The further the date is from the document’s date (either in the past or future) will reduce the score.

cool.6 - URL attractiveness

Homepages up-weighted. Copyright pages and URLS with lots of punctuation down-weighted.

URL type - uses the following heuristics:

  • Server default pages are really good. e.g. index.html or just / except that the homepage of the principal server for an organisation should not be up-weighted. e.g. www.example.com/ is not a good answer to return in a search of the example org but site.example.com/ is

  • Default pages for a directory (e.g. example.com/somepath/default.asp) are good but not as good as server default pages

  • Filenames whose root matches privacy, copyright, disclaimer, feedback, legals, sitemap etc. should be discouraged. The punctuation score is derived from the number of punctuation characters in the URL, e.g. slashes, dots, question marks, ampersands etc.

cool.7 - Annotation weight (ANNIE)

ANNIE or annotation indexing engine was a relatively late addition to the core code. It processes the anchor data into a separate index and can offer improvements. It needs to be enabled in the indexer options first, this is not used by default so in 99% of cases ignore.

cool.8 - Domain weight

No longer in use - superseded by the host ranking features (cool.23 to cool.28).

cool.9 - Proximity to origin

How close the document’s geospatial coordinates are to the origin (closer is better).

cool.10 - Non-binariness

Is the document a binary file (pdf, docx, etc…​) If yes, it gets a 0. If no (i.e. it’s a textual file) it gets 1.

cool.11 - Advertisements

Checks for the presence of google_ad_client in the body of a HTML document. If yes, it gets a fat 0. If no, it gets a 1. Why? In the generic web crawl world, once upon a time the presence of paid google ads was a negative indicator of a bad actor site. This is questionable now. Do not use.

cool.12 - Implicit phrase matching

Boosts results where your input appears as a phrase in the content. If you provide multiple term constraints (i.e. words) then any document where those words appear right next to one another get a bonus score. E.g. your search for Funnelback search has a document containing Funnelback search engine and it gets a boost. Another document has 'Funnelback is a search engine' and it gets 0.

cool.13 - Consistency of evidence

Looks for consistency between content and annotation evidence. Checks cool.0 and cool.7, if both are non-zero you get a 1, otherwise a 0. Since cool.7 is not on by default, this does nothing unless you turn it on.

cool.14 - Logarithm of annotation weight

Use this instead of cool.7 if cool.7 gives some extreme weighting differences. This is similar to cool.7 but the values for all documents are converted to their logarithm. Why logarithm? Because doing so removes skewing from the data. i.e. if you had some extremely large or small cool.7 scores, this removes the impact of those outliers.

cool.15 - Absolute-normalised logarithm of annotation weight

This un-skews the data even more than cool.14, so even less variability from outliers. Use if even cool.14 is giving crazy weight differences. As above, but rather than running log(absolute_annie) it first normalised the scores (converted them to a range of 0.0 to 1.0) so log(norm(annie)).

cool.16 - Annotation rank

Annotation rank = (k - rank)/k. where k = 2 x highest rank requested - if rank > k, rank = k

cool.17 - Field-weighted Okapi score

Field-weighted Okapi score.

cool.18 - Absolute-normalised Okapi score.

This is the normalised score of 'BM-25' which is a well-known algorithm for calculating how relevant a document is against some textual search constraints. We calculate it, this normalises it to minimise the impact of outliers. If you really want to know more about BM-25 see: https://en.wikipedia.org/wiki/Okapi_BM25. TL;DR BM-25 is a type of text relevance score, can be useful for content weighting issues.

cool.19 - Field-weighted Okapi rank

Field-weighted Okapi rank.

cool.20 - Main hosts bias

Up-weights the main (www) domain. If the domain of this document is the 'www' domain you get a big 1. If not, you get 0.

cool.21 - Data source component weighting

The influence provided by the data source component weightings. The relative component weightings are set in the search package configuration, using the meta.component.[component].weight configuration options.

cool.22 - Document number in the index

Gives a score based on your document number in the index (the order in which the documents were discovered/indexed). Literally first come, first serve. First gets 1.0, last gets 0.0.

cool.23 - Host incoming link score

This boosts based on how many links the domain has pointing at in the overall index. More links = more boost.

cool.24 - Host click score

How many result clicks for documents inside this domain has this domain received? The more clicks, the more boost. A measure of popularity of the domain over time, can be useful. Requires click tracking to be active.

cool.25 - Host linking hosts score

The count of how many other domains have at least 1 link to this domain. The higher the count, the more boost.

cool.26 - Host linked host score

How many domains does this domain link have at least 1 link to. The higher the count, the more boost.

cool.27 - Host rank in crawl

When was this domain encountered in the crawl? (web data sources only(. First come, first serve. Naturally has a lot to do with your start URLs.

cool.28 - Domain shallowness

The more sub-domains a host has, the lower the boost is. i.e. More .'s in your domain name means less score. So domain.example.com gets more boost than subdomain.domain.example.com. Shallowness means how close is the domain to the top of this domain.

cool.29 - Document URL matches regex pattern

Upweight documents that match a URL regex pattern, set via the indexer option -doc_feature_regex=<regex> If a document matches it gets a big 1, otherwise 0. The regex can match things like any URL with /news/ in it for instance.

cool.30 - Document URL does not match regex pattern

The opposite of cool.29.

cool.31 - Normalized title words

Normalized count of the number of words in the title of the document. The more words, the higher the score.

cool.32 - Normalized content words

Normalized count of words indexed for this document. Longer documents get more reward.

cool.33 - Normalized compressibility of document text

Normalized compressibility score of the document. By default, documents that are less compressible have less repeating segments, so are rewarded more.

cool.34 - Normalized document entropy

Calculates a normalized score on how predictable the text is based on how often words reoccur. Rewards unpredictable text as it is likely to contain more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.

cool.35 - Normalized stop word fraction

Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with less stop words are rewarded.

cool.36 - Normalized stop word cover

How many different stop words does the document use? Documents with fewer different stop words are rewarded.

cool.37 - Normalized average term length

How long are the words in the document on average? Documents with longer words are rewarded.

cool.38 - Normalized distinct words

How many unique words does the document contain? Documents with more unique words are rewarded.

cool.39 - Normalized maximum term frequency

For each document, calculates the most occurring words and then takes the highest of those. The lower that count is, the more reward.

cool.40 - Negative normalized title words

Normalized count of the number of words in the title of the document. The less words, the higher the score.

cool.41 - Negative normalized content words

Normalized count of words indexed for this document. Shorter documents get more reward.

cool.42 - Negative normalized compressibility of document text

Normalized compressibility score of the document. By default, documents that are more compressible have more repeating segments, so are rewarded more.

cool.43 - Negative normalized document entropy

Calculates a normalized score on how predictable the text is based on how often words reoccur. Penalizes unpredictable text as it is likely more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.

cool.44 - Negative normalized stop word fraction

Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with more stop words are rewarded.

cool.45 - Negative normalized stop word cover

How many different stop words does the document use? Documents with more different stop words are rewarded.

cool.46 - Negative normalized average term length

How long are the words in the document on average? Documents with shorter words are rewarded.

cool.47 - Negative normalized distinct words

How many unique words does the document contain? Documents with less unique words are rewarded.

cool.48 - Negative normalized maximum term frequency

For each document, calculates the most occurring words and then takes the highest of those. The higher that count is, the more reward.

cool.49 - Absolute title words

Absolute count of the number of words in the title of the document. The more words, the higher the score.

cool.50 - Absolute content words

Absolute count of words indexed for this document. Longer documents get more reward.

cool.51 - Absolute compressibility of document text

Absolute compressibility score of the document. By default, documents that are less compressible have less repeating segments, so are rewarded more.

cool.52 - Absolute document entropy

Calculates an absolute score on how predictable the text is based on how often words reoccur. Rewards unpredictable text as it is likely to contain more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.

cool.53 - Absolute stop word fraction

Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with less stop words are rewarded.

cool.54 - Absolute stop word cover

How many different stop words does the document use? Documents with fewer different stop words are rewarded.

cool.55 - Absolute average term length

How long are the words in the document on average? Documents with longer words are rewarded.

cool.56 - Absolute distinct words

How many unique words does the document contain? Documents with more unique words are rewarded.

cool.57 - Absolute maximum term frequency

For each document, calculates the most occurring words and then takes the highest of those. The lower that count is, the more reward.

cool.58 - Negative absolute title words

Absolute count of the number of words in the title of the document. The less words, the higher the score.

cool.59 - Negative absolute content words

Absolute count of words indexed for this document. Shorter documents get more reward.

cool.60 - Negative absolute compressibility of document text

Absolute compressibility score of the document. By default, documents that are more compressible have more repeating segments, so are rewarded more.

cool.61 - Negative absolute document entropy

Calculates aan absolute score on how predictable the text is based on how often words reoccur. Penalizes unpredictable text as it is likely more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.

cool.62 - Negative absolute stop word fraction

Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with more stop words are rewarded.

cool.63 - Negative absolute stop word cover

How many different stop words does the document use? Documents with more different stop words are rewarded.

cool.64 - Negative absolute average term length

How long are the words in the document on average? Documents with shorter words are rewarded.

cool.65 - Negative absolute distinct words

How many unique words does the document contain? Documents with less unique words are rewarded.

cool.66 - Negative absolute maximum term frequency

For each document, calculates the most occurring words and then takes the highest of those. The higher that count is, the more reward.

cool.67 - Lexical span

Related to the implicit phrase matching weighting (cool.12). Where cool.12 is a binary 1 or 0 reward (the words appear either as a phrase or do not) this is a more granular reward system. Similar to cool.12, this only takes effect if you enter multiple words. Lexical span measures 'how close together are the words in the text?'. As an example, you search for 'Funnelback search'. If the words are in a phrase, the lexical span is 0. If the words appear as 'Funnelback is a search engine' the lexical span for 'Funnelback search' is 2 because there are 2 words between what you are looking for.

cool.68 - Document matches cgscope1

Up-weight documents that match the gscope pattern defined by cgscope1.

cool.69 - Document matches cgscope2

Up-weight documents that match the gscope pattern defined by cgscope2.

cool.70 - Document does not match cgscope1

Up-weight documents that do not match the gscope pattern defined by cgscope1.

cool.71 - Document does not match cgscope2

Up-weight documents that do not match the gscope pattern defined by cgscope2.

cool.72 - Raw ANNIE

The raw (absolute) ANNIE score for a document, linearly scaled to 0..1. Has a more dramatic impact than cool.7. See cool.7 for more details.

Values

Values are unbounded, but typical weights range from 0-100.

Example

To set the query processor to ignore URL length, but give a high weight to phrase matches implied by the query:

query_processor_options=-cool.3=0 -cool.12=100