Scripted Workflow

Introduction

Scripted workflow allows you to define conditions and actions which can be executed during content filtering (both inline filtering and post-gather filtering). This document describes how to use this system.

Setting the System up

The first step is to edit the filter.classes parameter in your collection.cfg file and add the following string to the end:

com.funnelback.common.filter.WorkflowFilter

So you might have something like:

filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:DocumentFixerFilterProvider:com.funnelback.common.filter.WorkflowFilter

The second step is to go to the file manager and create a "workflow.cfg" file. This file will contain the conditions and actions you wish to define.

Workflow Script Examples

This section gives some examples of the script language that might be put in the workflow.cfg file.

if ((contentContains("(?i)ovum")).or(contentContains("Gartner"))) {
    if (urlContains("analyst-reviews")) {
        insertMetaTag("robots", "noindex");
    }
}

In the example above the content must contain either "Ovum" or "Gartner" and the URL must contain "analyst-reviews". The (?i) syntax means to use a case-insensitive match. If these conditions are met then a robots "noindex" meta tag will be inserted into the content, meaning that the document will not be indexed.

// Example of extraction of content for re-insertion
if ((urlContains("funnelback")).and(urlDoesNotStartWith("test")).and(contentContains("\\w+")).and(urlEndsWith(".pdf"))) {
    def matched = getMatchingContent("original(.*?)text");
    replaceContent "original(.*?)text", "replaced text: middle was [" + matched + "]"
}

In this second example we are extracting content for re-insertion. The "def" keyword is used to define a variable in the scripting language we use (Groovy).

// Example of title replacement
if ((urlContains("amazon")).or(urlDoesNotStartWith("test"))) {
    replaceContent "<title>(.*?)</title>", "<title>New Title</title>"
}

Here we are inserting a new title into the content using the "replaceContent" action, which takes a regular expression to match with and then some replacement text.

// Example of extracting content and inserting into metadata
if (urlEndsWith(".pdf")) {
    def matched = getMatchingContent("middle(.*?)content");

    if (matched != "") {
       insertMetaTag("my_meta_data", matched);
    }
}

In this last example we extract some matching content and insert it as meta data. It will be inserted into the "<head>...</head>" section of the document if it has one, or after the opening <html> tag otherwise.

Workflow Script Definition

This section describes the workflow script language i.e. how to define conditions and actions.

Function Description
urlContains(regex) Returns true if URL contains given regular expression, false otherwise.
urlDoesNotContain(regex) Returns true if URL does not contain given regular expression, false otherwise.
urlStartsWith(regex) Returns true if URL starts with the given regular expression, false otherwise.
urlDoesNotStartWith(regex) Returns true if URL does not start with the given regular expression, false otherwise.
urlEndsWith(regex) Returns true if URL ends with the given regular expression, false otherwise.
urlDoesNotEndWith(regex) Returns true if URL does not end with the given regular expression, false otherwise.
contentContains(regex) Returns true if content contains the given regular expression, false otherwise.
contentDoesNotContain(regex) Returns true if content does not contain the given regular expression, false otherwise.
contentStartsWith(regex) Returns true if content starts with the given regular expression, false otherwise.
contentDoesNotStartWith(regex) Returns true if content does not start with the given regular expression, false otherwise.
contentEndsWith(regex) Returns true if content ends with the given regular expression, false otherwise.
contentDoesNotEndWith(regex) Returns true if content does not end with the given regular expression, false otherwise.
replaceContent(regex, replacement) Modifies the document content by looking for all matches for the given regular expression and replacing them with the given replacement text.
getMatchingContent(regex) Returns the first matching section of the document content that matches the given regular expression.
insertMetaTag(name, content) Insert a meta tag with the given name and content values into the document.

Debugging

Error messages will be printed out to the filter.log file, or the crawler/gather log files if using inline filtering.