Writing custom HTML document (jsoup) filters
This guide covers HTML document (Jsoup) filters. See: Writing generic document filters for an equivalent guide for generic documents, or Transform or analyze content before it is indexed if you are unsure about different types of filters. |
Jsoup filters are special document filters that can be used to transform and manipulate HTML documents based on their DOM structure. Jsoup filters can be chained together to perform a series of modifications to the HTML document.
Before you start
Familiarize yourself with the plugin filter framework.
In particular:
-
Understand how HTML document (jsoup) filters function.
-
Familiarize yourself with Jsoup for extracting and modifying data.
-
Ensure you are familiar with the Java programming language.
A series of filter examples are presented below to illustrate implementation of various filter types.
Plugin scopes
The plugin scope for a plugin that implements jsoup filtering must include the runs on datasource scope.
Maven archetype template options
A html document/jsoup filter plugin template can be generated using the Maven archetype template
-
When using the interactive mode of Maven archetype template:
-
type
true
forDefine value for property 'jsoup-filtering'
-
type
true
forDefine value for property 'runs-on-datasource'
-
-
When using the non-interactive mode of Maven archetype template:
-
set the flag
-Djsoup-filtering=true
-
set the flag
-Druns-on-datasource=true
-
If you create a filter that extracts and adds metadata it is likely you will want to enable indexing templates to store metadata mappings. |
Interface methods
Each jsoup filter implemented by a plugin must implement the following interface:
IJSoupFilter
-
Processes an HTML document as a JSoup object. This provides a way of manipulating the HTML document using operations on the DOM.
Detailed filter framework documentation is included in the Javadoc documentation. |
Filter logic
The filter logic for a Jsoup filter is implemented within the processDocument(FilterContext c)
method.
The FilterContext
provides a number of methods for accessing Funnelback configuration and document metadata, and also provides access to the document object.
The Setup
method is called when the filter is initialized. This should be used for setup related tasks such as loading configuration so that they are not repeated for every document that is filtered.
Accessing the document object
The document content is contained within a jsoup document object, that can be accessed from the FilterContext
that is passed into the processDocument
method.
Jsoup provides numerous methods for interacting with and manipulating the DOM. For example you can use CSS-style selectors to target specific elements within the HTML DOM, and extract text or HTML from the element. Jsoup also provides many methods for manipulating the DOM, allowing operations such as the insertion and removal of elements.
import org.jsoup.nodes.Document;
@Override
public void processDocument(FilterContext context) {
// get the document as a Jsoup object
Document doc = context.getDocument();
// Retrieve the text of the document's <title> element
String title = doc.select("title").text();
}
Reading data source configuration keys
Data source configuration keys (including plugin configuration) can be read from the Jsoup filter using methods available in the SetupContext
. Configuration should be read in the filter’s setup()
method.
private String color;
public void setup(SetupContext setup) {
this.setup = setup;
this.mode = Optional.ofNullable(this.setup.getConfigSetting(PluginUtils.KEY_PREFIX+".color")).orElse("blue");
}
For jsoup filters the relevant methods are:
getConfigSetting
-
Returns the value of the specified configuration key.
getConfigKeysWithPrefix
-
Returns a list or matching configuration keys that begin with a specified prefix.
getConfigKeysMatchingPattern
-
Returns a list or matching configuration keys that match a specified regex pattern.
See the Jsoup Filter example - read data source configuration options example for more information on reading configuration.
Reading configuration from a file
The SetupContext
provides methods for reading in configuration in a similar manner to a generic document filter. See: Filter example - read a configuration file.
For jsoup filters the relevant methods are:
pluginConfigurationFile
-
Reads a configuration file for the currently running plugin as a UTF-8 String.
pluginConfigurationFileAsBytes
-
Reads a configuration file for the currently running plugin as a string.
See the Javadoc documentation for further information: SetupContext
Determining the filter class to add to the jsoup filter chain
The filter class name to add to your filter chain is determined by concatenating the filter’s package and class names.
e.g.
JsoupExampleFilter.java
package com.example.pluginexamples; (1)
public class JsoupExampleFilter implements IJSoupFilter { (2)
@Override
public void processDocument(FilterContext filterContext) {
//Jsoup filter logic
}
}
1 | The package name is com.example.pluginexamples |
2 | The public class name is JsoupExampleFilter . Note that the name of the java file must match the public class name - JsoupExampleFilter.java . |
This is added to the data source’s filter chain by adding com.example.pluginexamples.JsoupExampleFilter
to the filter.jsoup.classes
for the data source.
HTML document (jsoup) filter examples
-
Extract and add metadata: A simple example showing how to extract some information from the jsoup document object and add it as an additional metadata field on the document.
-
Reading configuration options: Shows how to define, read and use configuration keys, specified in the data source configuration.
See also
-
filter.jsoup.classes
configuration option