Best practices - 2.3 filtering

Background

This article outlines best practices to follow when implementing filters.

Naming conventions for Groovy filters

Groovy filter files location

Groovy filters are created inside the collection’s @groovy folder within a folder structure that matches the defined package name: e.g. for a filter with package com.organisation.package:

$SEARCH_HOME/conf/<collection>/@groovy/com/organisation/package/MyFilter.groovy

The folder structure (and corresponding package name) should be named so that it is client specific (if the filter is client specific) or using com.funnelback.production if it’s a reusable filter. E.g.

$SEARCH_HOME/conf/<collection>/@groovy/com/funnelback/production/DecodeHtml.groovy
$SEARCH_HOME/conf/<collection>/@groovy/au/gov/health/CleanTitles.groovy
$SEARCH_HOME/conf/<collection>/@groovy/edu/harvard/LookupCourseCode.groovy

File naming and directories

The name of the file should use the CamelCase syntax: begins with uppercase, every new word thereafter begins with uppercase eg: ExtractCountryMetadata.groovy.

The name of the package can only be lower case letters and numbers (no dashes or other symbols).

The name of the package should match the script location, and the name of the class should match the name of the file. Failure to follow this rule might cause class loading issues.

e.g. $SEARCH_HOME/lib/java/groovy/com/organisation/package/filters/ExtractCourses.groovy

package edu.harvard;

class ExtractCourses {
  // ...
}

Groovy filter dependencies

Dependencies should be grabbed from within the Groovy script using the grapes/grab statements.

Parsing and transforming HTML in custom filters

Parsing and transformation of HTML should really only be done as a last resort, as small changes to the code structure can break the processing code. Always try to get to make the required change in the source content whenever possible because any transformation will break if the source HTML code is updated in any way.

If you need to do it, it’s strongly recommended to use a proper HTML parser when manipulating HTML content such as JSoup. Using regular expressions to parse HTML is not recommended because:

  • It is not reliable as it will strongly depend on the actual HTML syntax.

  • For example using simple quotes or double quotes around HTML attributes: <img src='…​'> vs. <img src="…​">. Writing a regex that accounts for both syntaxes is difficult

  • Similarly, accounting for attribute order is difficult, e.g. <img src=…​ alt=…​> vs. <img alt=…​ src=…​>

  • The resulting regular expressions end up very complex and hard to maintain

  • Complex regular expressions on large HTML pages might time out or take a very long time to process, slowing down (sometimes blocking) the crawl

JSoup is a better approach as it doesn’t rely on the specific HTML syntax. Instead it builds a tree representation of the HTML and nodes can be selected using selectors that apply to the structure of the HTML, not its syntax.

XML can be processed in a similar manner by using the XSoup library. See the KGMetadata or SplitXml filters on GitHub for an example of how this library can be used. See: https://github.com/funnelback/groovy-filters/tree/master/crawl%20filters