Built-in filters - Run workflow filter rules (WorkflowFilter)

This feature is not available in the Squiz DXP.
This feature is deprecated and will be removed in a future version. Please update any existing implementations to use supported features.

The scripted workflow filter allows conditions and actions that can be executed during content filtering to be defined.

Configuring the scripted workflow filter

Enabling

  1. Edit the filter.classes parameter in your collectino configuration and add the following string to the end com.funnelback.common.filter.WorkflowFilter.

    Example

     filter.classes=TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.funnelback.common.filter.WorkflowFilter
  2. Create a workflow.cfg file using the Configuration file manager. This file will contain the conditions and actions you wish to define.

Configuring scripted workflow rules

The workflow.cfg contains Groovy code consisting of a number of if statements that perform a specified action.

Syntax

The [CONDITION] and [ACTION] values in the syntax examples below should be replaced with valid conditions and actions (listed in sections below).

The syntax for each workflow command is as follows:

if ([CONDITION]) {
    [ACTION]
}

Statements can be nested

if ([CONDITION1]) {
    if [CONDITION2] {
        [ACTION]
    }
}

Conditions can be combined using and and or commands:

if (([CONDITION1]).and([CONDITION2])) {
    [ACTION1]
}
if (([CONDITION3]).or([CONDITION4])) {
    [ACTION2]
}

Variables can be defined using the def keyword. groovy

def pubs = urlContains("publications");

if (publications == true) {
    [ACTION]
}

Conditions

Function Description

urlContains(regex)

Returns true if URL contains given regular expression, false otherwise.

urlDoesNotContain(regex)

Returns true if URL does not contain given regular expression, false otherwise.

urlStartsWith(regex)

Returns true if URL starts with the given regular expression, false otherwise.

urlDoesNotStartWith(regex)

Returns true if URL does not start with the given regular expression, false otherwise.

urlEndsWith(regex)

Returns true if URL ends with the given regular expression, false otherwise.

urlDoesNotEndWith(regex)

Returns true if URL does not end with the given regular expression, false otherwise.

contentContains(regex)

Returns true if content contains the given regular expression, false otherwise.

contentDoesNotContain(regex)

Returns true if content does not contain the given regular expression, false otherwise.

contentStartsWith(regex)

Returns true if content starts with the given regular expression, false otherwise.

contentDoesNotStartWith(regex)

Returns true if content does not start with the given regular expression, false otherwise.

contentEndsWith(regex)

Returns true if content ends with the given regular expression, false otherwise.

contentDoesNotEndWith(regex)

Returns true if content does not end with the given regular expression, false otherwise.

Actions

Function Description

replaceContent(regex, replacement)

Modifies the document content by looking for all matches for the given regular expression and replacing them with the given replacement text.

getMatchingContent(regex)

Returns the first matching section of the document content that matches the given regular expression.

insertMetaTag(name, content)

Insert a meta tag with the given name and content values into the document.

Debugging

Error messages will be printed to the gatherer or filter logs for the collection.

Examples

This section gives some examples of the script language that might be put in the workflow.cfg file.

if ((contentContains("(?i)ovum")).or(contentContains("Gartner"))) {
    if (urlContains("analyst-reviews")) {
        insertMetaTag("robots", "noindex");
    }
}

In the example above the content must contain either Ovum or Gartner and the URL must contain analyst-reviews. The (?i) syntax means to use a case-insensitive match. If these conditions are met then a robots noindex meta tag will be inserted into the content, meaning that the document will not be indexed.

// Example of extraction of content for re-insertion
if ((urlContains("funnelback")).and(urlDoesNotStartWith("test")).and(contentContains("\\w+")).and(urlEndsWith(".pdf"))) {
    def matched = getMatchingContent("original(.*?)text");
    replaceContent "original(.*?)text", "replaced text: middle was [" + matched + "]"
}

In this second example we are extracting content for re-insertion. The def keyword is used to define a variable in the scripting language we use (Groovy).

// Example of title replacement
if ((urlContains("amazon")).or(urlDoesNotStartWith("test"))) {
    replaceContent "<title>(.*?)
</title>", "
<title>New Title
</title>"
}

Here we are inserting a new title into the content using the replaceContent action, which takes a regular expression to match with and then some replacement text.

// Example of extracting content and inserting into metadata
if (urlEndsWith(".pdf")) {
    def matched = getMatchingContent("middle(.*?)content");

    if (matched != "") {
       insertMetaTag("my_meta_data", matched);
    }
}

In this last example we extract some matching content and insert it as meta data. It will be inserted into the "…​" section of the document if it has one, or after the opening tag otherwise.