Best practices - 2.3 filtering
Naming conventions for Groovy filters
Groovy filter files location
Groovy filters are created inside the collection’s @groovy
folder within a folder structure that matches the defined package name:
e.g. for a filter with package com.organisation.package
:
$SEARCH_HOME/conf/<collection>/@groovy/com/organisation/package/MyFilter.groovy
The folder structure (and corresponding package name) should be named so that it is client specific (if the filter is client specific) or using com.funnelback.production
if it’s a reusable filter. E.g.
$SEARCH_HOME/conf/<collection>/@groovy/com/funnelback/production/DecodeHtml.groovy
$SEARCH_HOME/conf/<collection>/@groovy/au/gov/health/CleanTitles.groovy
$SEARCH_HOME/conf/<collection>/@groovy/edu/harvard/LookupCourseCode.groovy
File naming and directories
The name of the file should use the CamelCase syntax: begins with uppercase, every new word thereafter begins with uppercase eg: ExtractCountryMetadata.groovy
.
The name of the package can only be lower case letters and numbers (no dashes or other symbols).
The name of the package should match the script location, and the name of the class should match the name of the file. Failure to follow this rule might cause class loading issues.
e.g. $SEARCH_HOME/lib/java/groovy/com/organisation/package/filters/ExtractCourses.groovy
package edu.harvard;
class ExtractCourses {
// ...
}
Parsing and transforming HTML in custom filters
Parsing and transformation of HTML should really only be done as a last resort, as small changes to the code structure can break the processing code. Always try to get to make the required change in the source content whenever possible because any transformation will break if the source HTML code is updated in any way.
If you need to do it, it’s strongly recommended to use a proper HTML parser when manipulating HTML content such as JSoup. Using regular expressions to parse HTML is not recommended because:
-
It is not reliable as it will strongly depend on the actual HTML syntax.
-
For example using simple quotes or double quotes around HTML attributes:
<img src='…'>
vs.<img src="…">
. Writing a regex that accounts for both syntaxes is difficult -
Similarly, accounting for attribute order is difficult, e.g.
<img src=… alt=…>
vs.<img alt=… src=…>
-
The resulting regular expressions end up very complex and hard to maintain
-
Complex regular expressions on large HTML pages might time out or take a very long time to process, slowing down (sometimes blocking) the crawl
JSoup is a better approach as it doesn’t rely on the specific HTML syntax. Instead it builds a tree representation of the HTML and nodes can be selected using selectors that apply to the structure of the HTML, not its syntax.
XML can be processed in a similar manner by using the XSoup library. See the KGMetadata or SplitXml filters on GitHub for an example of how this library can be used. See: https://github.com/funnelback/groovy-filters/tree/master/crawl%20filters