Remove stop words from a string in Groovy
Background
This article shows how stop words can be removed from a string in Groovy. This can be implemented as part of a Groovy filter or from a hook script that runs at query time.
Removing stop words can be useful in a number of instances. For example if you wish to produce a set of keywords from a title so that you can produce additional keywords metadata or auto-completion triggers.
Details
The following groovy code loads the stop words list and replaces each stop word within the source text with a comma. This can then be split and used as a series of keywords. If you just want to remove the stop words from the text replace with a space instead of a comma.
This code can be used as a filter or within hook scripts.
// Import required for Java regex patterns
import java.util.regex.Pattern
// Load the stop words list
def stop_file = "/opt/funnelback/share/lang/en_stopwords";
def stopwords = new File(stop_file).readLines();
// Text where stop words will be removed
def exampleText = "An example for stop word removal"
// Cycle through the list of stop words and remove these from the exampleText, replacing each occurrence with a comma
stopwords.each() { stop ->
Pattern p = Pattern.compile("(?i)\\b"+stop+"\\b")
exampleText = exampleText.replaceAll(p,",")
}
// collapse commas
exampleText = exampleText.replaceAll(/[,]+/,",")
// exampleText should be "example,stop word removal" after the stop words are removed.