Remove stop words from a string in Groovy

Background

This article shows how stop words can be removed from a string in Groovy. This can be implemented as part of a Groovy filter or from a hook script that runs at query time.

Removing stop words can be useful in a number of instances. For example if you wish to produce a set of keywords from a title so that you can produce additional keywords metadata or auto-completion triggers.

Details

The following groovy code loads the stop words list and replaces each stop word within the source text with a comma. This can then be split and used as a series of keywords. If you just want to remove the stop words from the text replace with a space instead of a comma.

This code can be used as a filter or within hook scripts.

// Import required for Java regex patterns
import java.util.regex.Pattern

// Load the stop words list
def stop_file = "/opt/funnelback/share/lang/en_stopwords";
def stopwords = new File(stop_file).readLines();

// Text where stop words will be removed
def exampleText = "An example for stop word removal"

// Cycle through the list of stop words and remove these from the exampleText, replacing each occurrence with a comma
stopwords.each() { stop ->
  Pattern p = Pattern.compile("(?i)\\b"+stop+"\\b")
  exampleText = exampleText.replaceAll(p,",")
}

// collapse commas
exampleText = exampleText.replaceAll(/[,]+/,",")
// exampleText should be "example,stop word removal" after the stop words are removed.