Writing generic document filters
This guide covers generic document filters. See: Writing custom HTML document (jsoup) filters for an equivalent guide for HTML documents, or Transform or analyze content before it is indexed if you are unsure about different types of filters. |
A document filter can be developed by writing a plugin that implements one of the plugin framework’s filtering interfaces.
A plugin can implement one or more document filters which enable documents to be manipulated before they are indexed by Funnelback.
For developers that are familiar with Funnelback’s deprecated custom Groovy filters, the process of implementing a plugin filter is almost identical, however it must be written in pure java rather than in Groovy. |
Plugin scopes
The plugin scope for a plugin that implements filtering must include the runs on datasource scope.
Maven archetype template options
A document filter plugin template can be generated using the Maven archetype template
-
When using the interactive mode of Maven archetype template:
-
type
true
forDefine value for property 'filtering'
-
type
true
forDefine value for property 'runs-on-datasource'
-
-
When using the non-interactive mode of Maven archetype template:
-
set the flag
-Dfiltering=true
-
set the flag
-Druns-on-datasource=true
-
If you create a filter that extracts and adds metadata it is likely you will want to enable indexing templates to store metadata mappings. |
Interface methods
Each filter implemented by a plugin must extend one of the Funnelback’s three filter interfaces: StringDocumentFilter
, BytesDocumentFilter
or Filter
. The interface to implement will depend on whether the filter requires access to the document’s content, and if the content should be interpreted as text or binary data.
StringDocumentFilter
-
Processes a text document as a UTF-8 encoded text string. For filtering string (text/non-binary) content e.g. plain text, XML, HTML, JSON, etc. See manipulating string documents for an example.
BytesDocumentFilter
-
Processes a binary document as a raw bytes stream. For filtering binary content. See Converting raw byte (binary) documents to strings for an example.
Filter
-
Processes a document with no access to content. Used for filters that do not edit or read the document content but should run on all documents. Can read/write the document metadata object. See Removing a document for an example.
Providing configuration for a filter
Custom filters can be configured via configuration keys set in the data source configuration, or via custom configuration files which are saved with the data source configuration.
The plugin framework provides built-in support for configuration keys and configuration files. This information can be used to configure the filter so that it is tailored for the data source on which it is enabled.
Controlling pre-filter checks
For document filters that implement the StringDocumentFilter
or BytesDocumentFilter
interface, a pre-filter check is used as a pre-condition to determine if the filter should run. This is commonly used to ensure the filter is only run on a document of a specific type, though any custom logic can be implemented here.
Restricting a filter to a document type is commonly achieved by either checking the document’s mime type. Three built-in functions are available to assist with checking for HTML, XML or JSON documents.
-
document.getDocumentType().isHTML()
returns true if the document is a HTML document. -
document.getDocumentType().isJSON()
returns true if the document is a JSON document. -
document.getDocumentType().isXML()
returns true if the document is a XML document.
You are not limited to these functions and can implement whatever conditional logic is appropriate for the filter. However, if you wish to limit your filter to these types you should use these built-in conditions so that consistent behavior applies across filters.
Example: Filter all documents
Returning ATTEMPT_FILTER
as a pre-filter check will ensure the filter runs for all documents.
public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
return PreFilterCheck.ATTEMPT_FILTER;
}
Example: Only run a generic document filter on XML documents
The following pre-filter check ensures that the filter applies only to XML documents.
public PreFilterCheck canFilter(NoContentDocument document, FilterContext context) {
// Only run this filter on XML documents
if (document.getDocumentType().isXML()) { (1)
return PreFilterCheck.ATTEMPT_FILTER; (2)
}
return PreFilterCheck.SKIP_FILTER; (3)
}
1 | If the document type is XML |
2 | Filter the document |
3 | Skip the filter - don’t run this filter on the document and pass it on to the next filter in the chain. |
Metadata handling within generic document filters
A very common use case of custom filters involves the extraction or modification of metadata associated with the document.
The main filter chain provides a metadata multimap that is passed between filters in the chain.
This metadata multimap holds metadata that is added via metadata.put()
calls in any of the generic document filters and enables downstream filters to directly access metadata added via a previous filter.
The metadata multimap does not include any metadata that is embedded within the document such as HTML <meta> tags, or metadata added via HTML document (JSoup) filters. At the conclusion of the filter chain the additional metadata is written as Warc header information when storing the filtered document on disk.
|
Initialize the metadata object
In order to access metadata within a filter, the filter code needs to request a copy of the metadata object.
The getCopyOfMetadata()
method can be used within a filter to populate a metadata object with any metadata that has been added to the document in previous filters.
ListMultimap<String, String> metadata = document.getCopyOfMetadata();
Adding metadata to the metadata object
Once you have a metadata object, you can add metadata to it using the put()
method.
e.g. add an author
metadata value of John Smith
:
metadata.put("author", "John Smith");
Automatically setting up metadata mappings
When developing a filter plugin you can also implement the plugin indexing interface which provides methods for automatically configuring metadata mappings.
Use this if your filter is adding metadata that you need to map as part of the setup of your data source.
Determining the filter class to add to the filter chain
The filter class name to add to your filter chain is determined by concatenating the filter’s package and class names.
e.g.
package com.example.pluginexamples; (1)
public class ExampleFilter implements StringDocumentFilter { (2)
@Override
public FilterResult filterAsStringDocument(StringDocument document, FilterContext filterContext) {
// Filter logic
}
}
1 | The package name is com.example.pluginexamples |
2 | The public class name is ExampleFilter |
This is added to the data source’s filter chain by adding com.example.pluginexamples.ExampleFilter
to the filter.classes
for the data source.
Generic document filter examples
-
Manipulating string (non binary) documents: A simple way of filtering a document as a string.
-
Filters which read collection configuration options: Demonstrates a filter which reads options from the data source configuration.
-
Filters which read a custom configuration file: Shows how to read from a custom configuration file for jsoup and general document filters.
-
Adding metadata based on document content: Demonstrates adding values to the document content based on the document content.
-
Adding metadata to all documents: Demonstrates adding metadata to a document regardless of the document content.
-
Accessing the filtered metadata: Shows how to iterate over the multimap containing metadata added via filters.
-
Modifying document URLs: Demonstrates modifying the document URL for any document.
-
Filtering a document into multiple documents: Demonstrates spiting a single input document into multiple documents.
-
Filtering a HTML document into multiple documents: Demonstrates splitting a single HTML document into multiple HTML documents.
-
Removing a document: Demonstrates removing a document using the filters, typically resulting in that document not being available in the search index.
-
Altering the document type: Demonstrates fixing the document type based on the content of the document.
-
Converting raw byte (binary) documents to Strings: This might be done when converting from a binary format such as
pdf
to text format such asHTML
orXML
. -
Manipulating raw byte (binary) documents: Demonstrates how to base64 encode a binary document.
See also
-
filter.classes
configuration option