Adding limited wildcard support to DAAT mode
Background
Wildcards are not generally permitted in search queries by Funnelback as wildcards result in a large performance hit.
Funnelback has two modes of processing a query:
-
Document at a time (DAAT): processes matches by examining each document and deciding if the document is relevant for search results. This is the default mode of processing queries.
-
Term at a time (TAAT): processed matches by examining each term then checking each document for relevance. This was the default mode of processing queries until Funnelback 10.
Document at a time is much more efficient in processing queries, especially over large datasets - however it does not support any form of wildcards. Term at a time mode supported wildcards (via a truncation operator). This slowed down the search dramatically but did allow wildcard matching.
Sometimes it is desirable to provide wildcard support - especially if a database-style query is being performed. This article shows how to configure a collection to provide limited wildcard support when using Funnelback’s document at a time mode.
Process
The following process can be used to add limited truncation support - allowing you to use the asterix operator on the right hand side (only) of words in a query. The code uses the auto-completion service to get 5 terms for each starred item. There is a configuration option to adjust the number of auto-completions to request from the service for each starred term.
This method provides an efficient way of supporting wildcards - however the expanded query won’t return all the matching results (compared to if a full expansion of the wildcard was performed). There is a fine balancing act between performance and completeness of the search result set and the goal of a web search engine such as Funnelback has always been to return a set of relevant results. This set isn’t necessarily complete as shortcuts are taken to determine relevant results quickly.
Step 1. Add asterix support and expand the query
Create a hook_pre_process.groovy
script for the collection with the following contents:
// Add partial query support to padre
// reads the query CGI parameter and creates a disjunctive query based on the suggestions returned from padre-qs, injected into the system query parameter
// imports required for access to the padre suggest service
import java.io.File;
import java.util.List;
import com.funnelback.common.config.DefaultValues;
import com.funnelback.dataapi.connector.padre.PadreConnector;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion.ActionType;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion.DisplayType;
import com.funnelback.common.Environment;
def logger = org.apache.logging.log4j.LogManager.getLogger("com.funnelback.MyHookScript")
def q = transaction.question
if (q.currentProfileConfig.get(["partial_query_enabled"]) == "true") {
// Convert a partial query into a set of query terms
// Maximum number of query terms to expand partial query to - read from configuration partial_query_expansion_index parameter.
// eg. partial_query=com might expand to query=[commerce commercial common computing]
def partial_query_expansion_index = 5
if ((q.currentProfileConfig.get(["partial_query_expansion_index"]) != null) && (q.currentProfileConfig.get(["partial_query_expansion_index"]).isInteger())) {
partial_query_expansion_index = Integer.parseInt(q.currentProfileConfig.get(["partial_query_expansion_index"]))
}
if (q.query != null) {
// explode the query and expand each item that ends with a *
def terms = q.query.tokenize(" ");
terms.each {
def term = it
if (term ==~ /\w+\*$/) {
//remove term from q.query
terms -= term
def termclean = term.replaceAll(~/\*$/,"")
// Read $SEARCH_HOME
def sH = Environment.getValidSearchHome().getCanonicalPath();
File searchHome = new File(sH)
File indexStem = new File(q.collection.currentProfileConfig.get(["collection_root"]) + File.separator + "live" + File.separator + "idx","index")
// NOTE: CONSTRUCTOR HAS CHANGED post v15.16 and requires 3 parameters
List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem,q.collection.id)
.suggest(termclean)
.suggestionCount(partial_query_expansion_index)
.fetch();
// Use this for v15.0-15.14
/* List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem)
.suggest(termclean)
.suggestionCount(partial_query_expansion_index)
.fetch(); */
// Use this instead for v14.2 and earlier
/* List<Suggestion> suggestions = new PadreConnector(indexStem)
.suggest(termclean)
.suggestionCount(partial_query_expansion_index)
.fetch(); */
// build the expanded query from the list of suggestions
def expanded_query = ''
suggestions.each {
expanded_query += '"'+it.key+'" '
}
// set the query to the expanded set of query terms ORed together
if (expanded_query != "") {
if (q.rawInputParameters["s"] == null) {
q.rawInputParameters["s"] = ["["+expanded_query+"]"]
}
else {
q.rawInputParameters["s"][0] += " ["+expanded_query+"]"
}
}
}
}
// reconstruct query.
q.query = terms.join(" ");
}
}
Step 2. Enable wildcard support and configure the level of expansion
Add the following to the search page configuration:
partial_query_enabled=true
# optionally add the following line to indicate a maximum number of terms to expand a starred term to. The default is 5
# e.g. expand each starred term with 3 expansions
partial_query_expansion_index=3
Queries such as Dan* Smith
should now be accepted - the expanded queries can be seen by viewing the JSON or XML output and looking at the query
/queryAsProcessed
/queryRaw
/querySystemRaw
/queryCleaned
values from the response packet. Expansions are injected into the querySystemRaw
element.