Adding limited wildcard support to DAAT mode

Background

Wildcards are not generally permitted in search queries by Funnelback as wildcards result in a large performance hit.

Funnelback has two modes of processing a query:

Document at a time (DAAT): processes matches by examining each document and deciding if the document is relevant for search results. This is the default mode of processing queries.
Term at a time (TAAT): processed matches by examining each term then checking each document for relevance. This was the default mode of processing queries until Funnelback 10.

Document at a time is much more efficient in processing queries, especially over large datasets - however it does not support any form of wildcards. Term at a time mode supported wildcards (via a truncation operator). This slowed down the search dramatically but did allow wildcard matching.

Sometimes it is desirable to provide wildcard support - especially if a database-style query is being performed. This article shows how to configure a collection to provide limited wildcard support when using Funnelback’s document at a time mode.

Process

The following process can be used to add limited truncation support - allowing you to use the asterix operator on the right hand side (only) of words in a query. The code uses the auto-completion service to get 5 terms for each starred item. There is a configuration option to adjust the number of auto-completions to request from the service for each starred term.

This method provides an efficient way of supporting wildcards - however the expanded query won’t return all the matching results (compared to if a full expansion of the wildcard was performed). There is a fine balancing act between performance and completeness of the search result set and the goal of a web search engine such as Funnelback has always been to return a set of relevant results. This set isn’t necessarily complete as shortcuts are taken to determine relevant results quickly.

Step 1. Add asterix support and expand the query

Create a hook_pre_process.groovy script for the collection with the following contents:

// Add partial query support to padre
// reads the query CGI parameter and creates a disjunctive query based on the suggestions returned from padre-qs, injected into the system query parameter

// imports required for access to the padre suggest service
import java.io.File;
import java.util.List;
import com.funnelback.common.config.DefaultValues;
import com.funnelback.dataapi.connector.padre.PadreConnector;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion.ActionType;
import com.funnelback.dataapi.connector.padre.suggest.Suggestion.DisplayType;
import com.funnelback.common.Environment;

def logger = org.apache.logging.log4j.LogManager.getLogger("com.funnelback.MyHookScript")

def q = transaction.question
if (q.currentProfileConfig.get(["partial_query_enabled"]) == "true") {
    // Convert a partial query into a set of query terms
    // Maximum number of query terms to expand partial query to - read from configuration partial_query_expansion_index parameter.
    // eg. partial_query=com might expand to query=[commerce commercial common computing]
    def partial_query_expansion_index = 5
    if ((q.currentProfileConfig.get(["partial_query_expansion_index"]) != null) && (q.currentProfileConfig.get(["partial_query_expansion_index"]).isInteger())) {
      partial_query_expansion_index = Integer.parseInt(q.currentProfileConfig.get(["partial_query_expansion_index"]))
    }
    if (q.query != null) {
        // explode the query and expand each item that ends with a *
        def terms = q.query.tokenize(" ");
        terms.each {
            def term = it
            if (term ==~ /\w+\*$/) {
                //remove term from q.query
                terms -= term
                def termclean = term.replaceAll(~/\*$/,"")
                // Read $SEARCH_HOME
                def sH = Environment.getValidSearchHome().getCanonicalPath();
                File searchHome = new File(sH)
                File indexStem = new File(q.collection.currentProfileConfig.get(["collection_root"]) + File.separator + "live" + File.separator + "idx","index")

		        // NOTE: CONSTRUCTOR HAS CHANGED post v15.16 and requires 3 parameters
		        List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem,q.collection.id)
		          .suggest(termclean)
		          .suggestionCount(partial_query_expansion_index)
		          .fetch();

		        // Use this for v15.0-15.14
		        /* List<Suggestion> suggestions = new PadreConnector(searchHome,indexStem)
		          .suggest(termclean)
		          .suggestionCount(partial_query_expansion_index)
		          .fetch(); */

				// Use this instead for v14.2 and earlier
		        /* List<Suggestion> suggestions = new PadreConnector(indexStem)
		          .suggest(termclean)
		          .suggestionCount(partial_query_expansion_index)
		          .fetch(); */

                // build the expanded query from the list of suggestions
                def expanded_query = ''
                suggestions.each {
                    expanded_query += '"'+it.key+'" '
                }
                // set the query to the expanded set of query terms ORed together
                if (expanded_query != "") {
                    if (q.rawInputParameters["s"] == null) {
                    q.rawInputParameters["s"] = ["["+expanded_query+"]"]
                    }
                    else {
                    q.rawInputParameters["s"][0] += " ["+expanded_query+"]"
                    }
                }
            }
        }
        // reconstruct query.
        q.query = terms.join(" ");
    }
}groovy

Step 2. Enable wildcard support and configure the level of expansion

Add the following to the search page configuration:

partial_query_enabled=true
# optionally add the following line to indicate a maximum number of terms to expand a starred term to.  The default is 5
# e.g. expand each starred term with 3 expansions
partial_query_expansion_index=3text

Queries such as Dan* Smith should now be accepted - the expanded queries can be seen by viewing the JSON or XML output and looking at the query/queryAsProcessed/queryRaw/querySystemRaw/queryCleaned values from the response packet. Expansions are injected into the querySystemRaw element.

Help Center

Menu

Adding limited wildcard support to DAAT mode

Background

Process

Step 1. Add asterix support and expand the query

Step 2. Enable wildcard support and configure the level of expansion