Adding limited wildcard support to DAAT mode

Background

Wildcards are not generally permitted in search queries by Funnelback as wildcards result in a large performance hit.

Funnelback has two modes of processing a query:

  • Document at a time (DAAT): processes matches by examining each document and deciding if the document is relevant for search results. This is the default mode of processing queries.

  • Term at a time (TAAT): processed matches by examining each term then checking each document for relevance. This was the default mode of processing queries until Funnelback 10.

Document at a time is much more efficient in processing queries, especially over large datasets - however it does not support any form of wildcards. Term at a time mode supported wildcards (via a truncation operator). This slowed down the search dramatically but did allow wildcard matching.

Sometimes it is desirable to provide wildcard support - especially if a database-style query is being performed. This article shows how to configure a collection to provide limited wildcard support when using Funnelback’s document at a time mode.

Process

The following process can be used to add limited truncation support - allowing you to use the asterix operator on the right hand side (only) of words in a query. The code uses the auto-completion service to get 5 terms for each starred item. There is a configuration option to adjust the number of auto-completions to request from the service for each starred term.

This method provides an efficient way of supporting wildcards - however the expanded query won’t return all the matching results (compared to if a full expansion of the wildcard was performed). There is a fine balancing act between performance and completeness of the search result set and the goal of a web search engine such as Funnelback has always been to return a set of relevant results. This set isn’t necessarily complete as shortcuts are taken to determine relevant results quickly.

Step 1. Add asterix support and expand the query

Create a hook_pre_process.groovy script for the collection with the following contents:

// Add partial query support to padre
// reads the query CGI parameter and creates a disjunctive query based on the suggestions returned from padre-qs, injected into the system query parameter

// imports required for access to the padre suggest service
import java.io.File;
import java.util.List;
import com.funnelback.common.config.DefaultValues;
import com.funnelback.dataapi.connector.padre.PadreConnector;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion.ActionType;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion.DisplayType;
import com.funnelback.common.Environment;

def logger = org.apache.logging.log4j.LogManager.getLogger("com.funnelback.MyHookScript")

def q = transaction.question
if (q.currentProfileConfig.get(["partial_query_enabled"]) == "true") {
    // Convert a partial query into a set of query terms
    // Maximum number of query terms to expand partial query to - read from configuration partial_query_expansion_index parameter.
    // eg. partial_query=com might expand to query=[commerce commercial common computing]
    def partial_query_expansion_index = 5
    if ((q.currentProfileConfig.get(["partial_query_expansion_index"]) != null) && (q.currentProfileConfig.get(["partial_query_expansion_index"]).isInteger())) {
      partial_query_expansion_index = Integer.parseInt(q.currentProfileConfig.get(["partial_query_expansion_index"]))
    }
    if (q.query != null) {
        // explode the query and expand each item that ends with a *
        def terms = q.query.tokenize(" ");
        terms.each {
            def term = it
            if (term ==~ /\w+\*$/) {
                //remove term from q.query
                terms -= term
                def termclean = term.replaceAll(~/\*$/,"")
                // Read $SEARCH_HOME
                def sH = Environment.getValidSearchHome().getCanonicalPath();
                File searchHome = new File(sH)
                File indexStem = new File(q.collection.currentProfileConfig.get(["collection_root"]) + File.separator + "live" + File.separator + "idx","index")

		        // NOTE: CONSTRUCTOR HAS CHANGED post v15.16 and requires 3 parameters
		        List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem,q.collection.id)
		          .suggest(termclean)
		          .suggestionCount(partial_query_expansion_index)
		          .fetch();

		        // Use this for v15.0-15.14
		        /* List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem)
		          .suggest(termclean)
		          .suggestionCount(partial_query_expansion_index)
		          .fetch(); */

				// Use this instead for v14.2 and earlier
		        /* List<Suggestion> suggestions = new PadreConnector(indexStem)
		          .suggest(termclean)
		          .suggestionCount(partial_query_expansion_index)
		          .fetch(); */

                // build the expanded query from the list of suggestions
                def expanded_query = ''
                suggestions.each {
                    expanded_query += '"'+it.key+'" '
                }
                // set the query to the expanded set of query terms ORed together
                if (expanded_query != "") {
                    if (q.rawInputParameters["s"] == null) {
                    q.rawInputParameters["s"] = ["["+expanded_query+"]"]
                    }
                    else {
                    q.rawInputParameters["s"][0] += " ["+expanded_query+"]"
                    }
                }
            }
        }
        // reconstruct query.
        q.query = terms.join(" ");
    }
}

Step 2. Enable wildcard support and configure the level of expansion

Add the following to the search page configuration:

partial_query_enabled=true
# optionally add the following line to indicate a maximum number of terms to expand a starred term to.  The default is 5
# e.g. expand each starred term with 3 expansions
partial_query_expansion_index=3

Queries such as Dan* Smith should now be accepted - the expanded queries can be seen by viewing the JSON or XML output and looking at the query/queryAsProcessed/queryRaw/querySystemRaw/queryCleaned values from the response packet. Expansions are injected into the querySystemRaw element.