Padre cooler ranking options
Every query submitted to Funnelback imposes a number of constraints on the search request.
Constraints are what you are asking from Funnelback: your search terms (i.e. words), any metadata restrictions (which includes facets), URL scope restrictions, etc. Funnelback offers a number of ways to formulate a question and retrieve data. Each individual thing you ask is a constraint that Funnelback will try and match.
Funnelback will partial match by default, so whilst it endeavours to return things that match all your constraints it will happily give you some that match only some (to encourage some results to return). Partial matching refers to results that match some of the contstraints (don’t confuse this with partially matching some of the query words). You can turn this off by setting a query processor option of fmo=true
which will means only return fully matching results.
Some examples of constraints:
Hello world
: this is two words, so two constraints.
"Hello world"
: the quotes make this a phrase, so one constraint.
"Hello world" -planet
: this is two constraints, the phrase and the negative 'planet' (i.e. you do NOT want that word to appear)
sand and orsand parameters (items prefixed with a | in the query language) do not count as constraints for the purposes of partial matching in Funnelback
|
Cooler rankings within Funnelback are the different signals (factors/dimensions are other words for it) that make up the rank or score of a document for any given search constraints.
Some of these signals are based on the constraints you provide, others are based on the properties of the documents themselves. E.g. how many words in the document, how long is the URL.
Query processor options (QPOs) are the different configuration parameters you can pass to Funnelback’s query processor (padre). Cooler ranking options are a subset of the query processor options.
Cooler ranking options
This page describes the cooler ranking options which define the main influences that are applied to the Funnelback ranking algorithm.
These options can either be set in query processor options (collection.cfg) or using CGI parameters (e.g. …&cool.2=12&cool.3=34…
).
The set of cooler options from The set of cooler options from
The default algorithm uses |
- cool.0 - Content weight
-
How well does your search match the content of the document? This is a composite score of many different elements that take your search terms and metadata constraints and run fancy algorithms. Don’t turn this one off, it matters.
- cool.1 - On-site link weight
-
Links from within the same domain as this document where the anchor text (words that make up the hyperlink text) contains your constraints.
- cool.2 - Off-site link weight
-
Links from other domains where the anchor text contains your constraints. In a normal world this would be trusted more than on-site, since these pages aren’t generally written by the same people as the source document.
- cool.3 - URL length weight
-
The length of the URL. Shorter URLs receive a higher ranking. This amounts for 45% of the score in the default algorithm.
- cool.4 - External evidence (QIE) weight
-
Query independent evidence (QIE) that enables an admin to up-weight or down-weight specific groups of content based on URL patterns. This ranking option controls the amount of influence given by QIE.
- cool.5 - Date proximity weight
-
How close is the document’s date to the current date? (or a reference date via a query processor option if you need a specific date). You need to have dates on your documents for this to work at all. The further the date is from the document’s date (either in the past or future) will reduce the score.
- cool.6 - URL attractiveness
-
Homepages up-weighted. Copyright pages and URLS with lots of punctuation down-weighted.
URL type - uses the following heuristics:
-
Server default pages are really good. e.g.
index.html
or just/
except that the homepage of the principal server for an organisation should not be up-weighted. e.g.www.example.com/
is not a good answer to return in a search of the example org butsite.example.com/
is -
Default pages for a directory (e.g.
example.com/somepath/default.asp
) are good but not as good as server default pages -
Filenames whose root matches privacy, copyright, disclaimer, feedback, legals, sitemap etc. should be discouraged. The punctuation score is derived from the number of punctuation characters in the URL, e.g. slashes, dots, question marks, ampersands etc.
-
- cool.7 - Annotation weight (ANNIE)
-
ANNIE or annotation indexing engine was a relatively late addition to the core code. It processes the anchor data into a separate index and can offer improvements. It needs to be enabled in the indexer options first, this is not used by default so in 99% of cases ignore.
- cool.8 - Domain weight
-
No longer in use - superseded by the host ranking features (
cool.23
tocool.28
). - cool.9 - Proximity to origin
-
How close the document’s geospatial coordinates are to the origin (closer is better).
- cool.10 - Non-binariness
-
Is the document a binary file (pdf, docx, etc…) If yes, it gets a 0. If no (i.e. it’s a textual file) it gets 1.
- cool.11 - Advertisements
-
Checks for the presence of
google_ad_client
in the body of a HTML document. If yes, it gets a fat 0. If no, it gets a 1. Why? In the generic web crawl world, once upon a time the presence of paid google ads was a negative indicator of a bad actor site. This is questionable now. Do not use. - cool.12 - Implicit phrase matching
-
Boosts results where your input appears as a phrase in the content. If you provide multiple term constraints (i.e. words) then any document where those words appear right next to one another get a bonus score. E.g. your search for Funnelback search has a document containing Funnelback search engine and it gets a boost. Another document has 'Funnelback is a search engine' and it gets 0.
- cool.13 - Consistency of evidence
-
Looks for consistency between content and annotation evidence. Checks
cool.0
andcool.7
, if both are non-zero you get a 1, otherwise a 0. Sincecool.7
is not on by default, this does nothing unless you turn it on. - cool.14 - Logarithm of annotation weight
-
Use this instead of
cool.7
ifcool.7
gives some extreme weighting differences. This is similar to cool.7 but the values for all documents are converted to their logarithm. Why logarithm? Because doing so removes skewing from the data. i.e. if you had some extremely large or small cool.7 scores, this removes the impact of those outliers. - cool.15 - Absolute-normalised logarithm of annotation weight
-
This un-skews the data even more than
cool.14
, so even less variability from outliers. Use if evencool.14
is giving crazy weight differences. As above, but rather than runninglog(absolute_annie)
it first normalised the scores (converted them to a range of 0.0 to 1.0) solog(norm(annie))
. - cool.16 - Annotation rank
-
Annotation rank = (k - rank)/k. where k = 2 x highest rank requested - if rank > k, rank = k
- cool.17 - Field-weighted Okapi score
-
Field-weighted Okapi score.
- cool.18 - Absolute-normalised Okapi score.
-
This is the normalised score of 'BM-25' which is a well-known algorithm for calculating how relevant a document is against some textual search constraints. We calculate it, this normalises it to minimise the impact of outliers. If you really want to know more about BM-25 see: https://en.wikipedia.org/wiki/Okapi_BM25. TL;DR BM-25 is a type of text relevance score, can be useful for content weighting issues.
- cool.19 - Field-weighted Okapi rank
-
Field-weighted Okapi rank.
- cool.20 - Main hosts bias
-
Up-weights the main (www) domain. If the domain of this document is the 'www' domain you get a big 1. If not, you get 0.
- cool.21 - Data source component weighting
-
The influence provided by the data source component weightings. The relative component weightings are set in the search package configuration, using the
meta.component.[component].weight
configuration options. - cool.22 - Document number in the index
-
Gives a score based on your document number in the index (the order in which the documents were discovered/indexed). Literally first come, first serve. First gets 1.0, last gets 0.0.
- cool.23 - Host incoming link score
-
This boosts based on how many links the domain has pointing at in the overall index. More links = more boost.
- cool.24 - Host click score
-
How many result clicks for documents inside this domain has this domain received? The more clicks, the more boost. A measure of popularity of the domain over time, can be useful. Requires click tracking to be active.
- cool.25 - Host linking hosts score
-
The count of how many other domains have at least 1 link to this domain. The higher the count, the more boost.
- cool.26 - Host linked host score
-
How many domains does this domain link have at least 1 link to. The higher the count, the more boost.
- cool.27 - Host rank in crawl
-
When was this domain encountered in the crawl? (web data sources only(. First come, first serve. Naturally has a lot to do with your start URLs.
- cool.28 - Domain shallowness
-
The more sub-domains a host has, the lower the boost is. i.e. More .'s in your domain name means less score. So domain.example.com gets more boost than subdomain.domain.example.com. Shallowness means how close is the domain to the top of this domain.
- cool.29 - Document URL matches regex pattern
-
Upweight documents that match a URL regex pattern, set via the indexer option
-doc_feature_regex=<regex>
If a document matches it gets a big 1, otherwise 0. The regex can match things like any URL with/news/
in it for instance. - cool.30 - Document URL does not match regex pattern
-
The opposite of
cool.29
. - cool.31 - Normalized title words
-
Normalized count of the number of words in the title of the document. The more words, the higher the score.
- cool.32 - Normalized content words
-
Normalized count of words indexed for this document. Longer documents get more reward.
- cool.33 - Normalized compressibility of document text
-
Normalized compressibility score of the document. By default, documents that are less compressible have less repeating segments, so are rewarded more.
- cool.34 - Normalized document entropy
-
Calculates a normalized score on how predictable the text is based on how often words reoccur. Rewards unpredictable text as it is likely to contain more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.
- cool.35 - Normalized stop word fraction
-
Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with less stop words are rewarded.
- cool.36 - Normalized stop word cover
-
How many different stop words does the document use? Documents with fewer different stop words are rewarded.
- cool.37 - Normalized average term length
-
How long are the words in the document on average? Documents with longer words are rewarded.
- cool.38 - Normalized distinct words
-
How many unique words does the document contain? Documents with more unique words are rewarded.
- cool.39 - Normalized maximum term frequency
-
For each document, calculates the most occurring words and then takes the highest of those. The lower that count is, the more reward.
- cool.40 - Negative normalized title words
-
Normalized count of the number of words in the title of the document. The less words, the higher the score.
- cool.41 - Negative normalized content words
-
Normalized count of words indexed for this document. Shorter documents get more reward.
- cool.42 - Negative normalized compressibility of document text
-
Normalized compressibility score of the document. By default, documents that are more compressible have more repeating segments, so are rewarded more.
- cool.43 - Negative normalized document entropy
-
Calculates a normalized score on how predictable the text is based on how often words reoccur. Penalizes unpredictable text as it is likely more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.
- cool.44 - Negative normalized stop word fraction
-
Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with more stop words are rewarded.
- cool.45 - Negative normalized stop word cover
-
How many different stop words does the document use? Documents with more different stop words are rewarded.
- cool.46 - Negative normalized average term length
-
How long are the words in the document on average? Documents with shorter words are rewarded.
- cool.47 - Negative normalized distinct words
-
How many unique words does the document contain? Documents with less unique words are rewarded.
- cool.48 - Negative normalized maximum term frequency
-
For each document, calculates the most occurring words and then takes the highest of those. The higher that count is, the more reward.
- cool.49 - Absolute title words
-
Absolute count of the number of words in the title of the document. The more words, the higher the score.
- cool.50 - Absolute content words
-
Absolute count of words indexed for this document. Longer documents get more reward.
- cool.51 - Absolute compressibility of document text
-
Absolute compressibility score of the document. By default, documents that are less compressible have less repeating segments, so are rewarded more.
- cool.52 - Absolute document entropy
-
Calculates an absolute score on how predictable the text is based on how often words reoccur. Rewards unpredictable text as it is likely to contain more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.
- cool.53 - Absolute stop word fraction
-
Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with less stop words are rewarded.
- cool.54 - Absolute stop word cover
-
How many different stop words does the document use? Documents with fewer different stop words are rewarded.
- cool.55 - Absolute average term length
-
How long are the words in the document on average? Documents with longer words are rewarded.
- cool.56 - Absolute distinct words
-
How many unique words does the document contain? Documents with more unique words are rewarded.
- cool.57 - Absolute maximum term frequency
-
For each document, calculates the most occurring words and then takes the highest of those. The lower that count is, the more reward.
- cool.58 - Negative absolute title words
-
Absolute count of the number of words in the title of the document. The less words, the higher the score.
- cool.59 - Negative absolute content words
-
Absolute count of words indexed for this document. Shorter documents get more reward.
- cool.60 - Negative absolute compressibility of document text
-
Absolute compressibility score of the document. By default, documents that are more compressible have more repeating segments, so are rewarded more.
- cool.61 - Negative absolute document entropy
-
Calculates aan absolute score on how predictable the text is based on how often words reoccur. Penalizes unpredictable text as it is likely more 'unique' content. Calculation is based on an algorithm by Bendersky & Croft.
- cool.62 - Negative absolute stop word fraction
-
Stop words are words like 'the', 'in', 'not' in English. Extremely frequent words that add no real value from a retrieval perspective. Documents with more stop words are rewarded.
- cool.63 - Negative absolute stop word cover
-
How many different stop words does the document use? Documents with more different stop words are rewarded.
- cool.64 - Negative absolute average term length
-
How long are the words in the document on average? Documents with shorter words are rewarded.
- cool.65 - Negative absolute distinct words
-
How many unique words does the document contain? Documents with less unique words are rewarded.
- cool.66 - Negative absolute maximum term frequency
-
For each document, calculates the most occurring words and then takes the highest of those. The higher that count is, the more reward.
- cool.67 - Lexical span
-
Related to the implicit phrase matching weighting (
cool.12
). Wherecool.12
is a binary 1 or 0 reward (the words appear either as a phrase or do not) this is a more granular reward system. Similar tocool.12
, this only takes effect if you enter multiple words. Lexical span measures 'how close together are the words in the text?'. As an example, you search for 'Funnelback search'. If the words are in a phrase, the lexical span is 0. If the words appear as 'Funnelback is a search engine' the lexical span for 'Funnelback search' is 2 because there are 2 words between what you are looking for. - cool.68 - Document matches
cgscope1
-
Up-weight documents that match the gscope pattern defined by
cgscope1
. - cool.69 - Document matches
cgscope2
-
Up-weight documents that match the gscope pattern defined by
cgscope2
. - cool.70 - Document does not match
cgscope1
-
Up-weight documents that do not match the gscope pattern defined by
cgscope1
. - cool.71 - Document does not match
cgscope2
-
Up-weight documents that do not match the gscope pattern defined by
cgscope2
. - cool.72 - Raw ANNIE
-
The raw (absolute) ANNIE score for a document, linearly scaled to 0..1. Has a more dramatic impact than
cool.7
. Seecool.7
for more details.
Example
To set the query processor to ignore URL length, but give a high weight to phrase matches implied by the query:
query_processor_options=-cool.3=0 -cool.12=100