Writing custom HTML document (jsoup) filters

This guide covers HTML document (Jsoup) filters. See: Writing generic document filters for an equivalent guide for generic documents, or Transform or analyze content before it is indexed if you are unsure about different types of filters.

Jsoup filters are special document filters that can be used to transform and manipulate HTML documents based on their DOM structure. Jsoup filters can be chained together to perform a series of modifications to the HTML document.

Before you start

Familiarize yourself with the plugin filter framework.

In particular:

  • Understand how HTML document (jsoup) filters function.

  • Familiarize yourself with Jsoup for extracting and modifying data.

  • Ensure you are familiar with the Java programming language.

A series of filter examples are presented below to illustrate implementation of various filter types.

Plugin scopes

The plugin scope for a plugin that implements jsoup filtering must include the runs on datasource scope.

Maven archetype template options

A html document/jsoup filter plugin template can be generated using the Maven archetype template

  • When using the interactive mode of Maven archetype template:

    • type true for Define value for property 'jsoup-filtering'

    • type true for Define value for property 'runs-on-datasource'

  • When using the non-interactive mode of Maven archetype template:

    • set the flag -Djsoup-filtering=true

    • set the flag -Druns-on-datasource=true

If you create a filter that extracts and adds metadata it is likely you will want to enable indexing templates to store metadata mappings.

Interface methods

Each jsoup filter implemented by a plugin must implement the following interface:

IJSoupFilter

Processes an HTML document as a JSoup object. This provides a way of manipulating the HTML document using operations on the DOM.

Detailed filter framework documentation is included in the Javadoc documentation.

Filter logic

The filter logic for a Jsoup filter is implemented within the processDocument(FilterContext c) method.

The FilterContext provides a number of methods for accessing Funnelback configuration and document metadata, and also provides access to the document object.

The Setup method is called when the filter is initialized. This should be used for setup related tasks such as loading configuration so that they are not repeated for every document that is filtered.

Accessing the document object

The document content is contained within a jsoup document object, that can be accessed from the FilterContext that is passed into the processDocument method.

Jsoup provides numerous methods for interacting with and manipulating the DOM. For example you can use CSS-style selectors to target specific elements within the HTML DOM, and extract text or HTML from the element. Jsoup also provides many methods for manipulating the DOM, allowing operations such as the insertion and removal of elements.

import org.jsoup.nodes.Document;

@Override
public void processDocument(FilterContext context) {
    // get the document as a Jsoup object
    Document doc = context.getDocument();

    // Retrieve the text of the document's <title> element
    String title = doc.select("title").text();
}

Reading data source configuration keys

Data source configuration keys (including plugin configuration) can be read from the Jsoup filter using methods available in the SetupContext. Configuration should be read in the filter’s setup() method.

    private String color;

    public void setup(SetupContext setup) {
        this.setup = setup;

        this.mode = Optional.ofNullable(this.setup.getConfigSetting(PluginUtils.KEY_PREFIX+".color")).orElse("blue");
    }

For jsoup filters the relevant methods are:

getConfigSetting

Returns the value of the specified configuration key.

getConfigKeysWithPrefix

Returns a list or matching configuration keys that begin with a specified prefix.

getConfigKeysMatchingPattern

Returns a list or matching configuration keys that match a specified regex pattern.

See the Jsoup Filter example - read data source configuration options example for more information on reading configuration.

Reading configuration from a file

The SetupContext provides methods for reading in configuration in a similar manner to a generic document filter. See: Filter example - read a configuration file.

For jsoup filters the relevant methods are:

pluginConfigurationFile

Reads a configuration file for the currently running plugin as a UTF-8 String.

pluginConfigurationFileAsBytes

Reads a configuration file for the currently running plugin as a string.

See the Javadoc documentation for further information: SetupContext

Logging

Log messages for filtering will appear in the gatherer’s filter logs.

Determining the filter class to add to the jsoup filter chain

The filter class name to add to your filter chain is determined by concatenating the filter’s package and class names.

e.g.

JsoupExampleFilter.java
package com.example.pluginexamples; (1)

public class JsoupExampleFilter implements IJSoupFilter { (2)

    @Override
    public void processDocument(FilterContext filterContext) {
        //Jsoup filter logic
    }
}
1 The package name is com.example.pluginexamples
2 The public class name is JsoupExampleFilter. Note that the name of the java file must match the public class name - JsoupExampleFilter.java.

This is added to the data source’s filter chain by adding com.example.pluginexamples.JsoupExampleFilter to the filter.jsoup.classes for the data source.

HTML document (jsoup) filter examples

  • Extract and add metadata: A simple example showing how to extract some information from the jsoup document object and add it as an additional metadata field on the document.

  • Reading configuration options: Shows how to define, read and use configuration keys, specified in the data source configuration.