Post-gather filtering
Background
This document outlines how to apply post-gather filtering to an existing warc file. This differs from normal inline filtering which occurs as the documents are being gathered.
Usage
$ /opt/funnelback/linbin/java/bin/java -classpath "/opt/funnelback/lib/java/all/*" com.funnelback.common.filter.util.FilterWarcFile
usage: com.funnelback.common.filter.util.FilterWarcFile
-collection <arg> Collection to use to read filters specific config
-filter <arg> Filter classes to use (Same syntax as in
collection.cfg
-in <arg> Input WARC file stem
-out <arg> Output WARC File stem
-v Enable verbose output
-w If set, the output WARC file will be overwritten if
it exists
Caveats
-
Post-gather filtering of binary documents is not supported prior to Funnelback 15.14.0.10.
-
Post-gather filtering allows a warc file to be filtered multiple times. Many filters are designed only to run once so any filters that run post-gather may need modification to account for this scenario, especially if the warc file may be re-filtered multiple times. For example, a filter that modifies the structure of a document may break the document if the same operation is run multiple times in succession, or a filter that injects metadata may inject duplicate metadata if run multiple times. For example:
-
Some filters have pre-requisites that may not be met if some filtering happens at gather time and some filtering happens post-gather.
-
If the Tika filters are run at gather time (to convert PDFs and other binary documents to text) it is not possible to check these for WCAG compliance using the accessibility auditor filters at a later date (because the accessibility auditor filters require the raw binary stream rather than the extracted text).
-
If the filter is appending crawl-time information to the source document (such as HTTP response headers or calculated response times).
-
Process
-
Switch to the collection containing the warc file that you wish to filter. If you have a standalone warc file create a new collection first and place the warc file into the collection’s live (or offline) data folder (e.g.
$SEARCH_HOME/data/COLLECTION_NAME/live/data/
). Although it’s possible to filter a warc file at any location, some filters rely on the collection name (and associatedcollection.cfg
keys) to run correctly. -
Ensure the
collection.cfg
has appropriate configuration defined for any filters. Thefilter.classes
can be ignored as these are set when the post-gather filter command is run. Othercollection.cfg
based filter options should be set includingfilter.jsoup.classes
and any options that are read by the filters that run (such asfilter.jsoup.undesirable_text-source
). -
Run the post-gather filter command:
$ ${SEARCH_HOME}/linbin/java/bin/java ${JAVA_OPTS} -classpath "${SEARCH_HOME}/lib/java/all/*:${SEARCH_HOME}/conf/${COLLECTION_NAME}/@groovy/*" com.funnelback.common.filter.util.FilterWarcFile -collection ${COLLECTION_NAME} -filter ${FILTER_CHAIN} -in ${SEARCH_HOME}/data/${COLLECTION_NAME}/${CURRENT_VIEW}/data/funnelback-web-crawl -out ${SEARCH_HOME}/data/${COLLECTION_NAME}/${CURRENT_VIEW}/data/funnelback-web-crawl_filtered -v
where:
-
${SEARCH_HOME}
is replaced with the$SEARCH_HOME
value (Funnelback install root) -
${JAVA_OPTS}
is replaced with any java options required for the filter process. Options that you may wish to set include:-
-Xmx
to set the available memory heap (eg.-Xmx500m
) -
-Dlog4j.configurationFile=/opt/funnelback/conf/log4j2.xml.default
- to use the default logging configuration.
-
-
${COLLECTION_NAME}
is replaced with the collection name containing the configuration and warc files. -
${FILTER_CHAIN}
is replaced with the filter chain (filter.classes
value) -
-classpath
sets the Java classpath. The value defined in the example able sets the same paths as used when inline filtering runs (the main Funnelback Java libs, plus whatever is in the collection’s@groovy
folder). -
-in
sets the input warc file -
-out
sets the output warc file
-
Example
Filter the warc file (/opt/funnelback/data/my-collection/live/data/funnelback-web-crawl.warc
) using the filter chain (TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
) and write the filtered warc file to /opt/funnelback/data/my-collection/live/data/funnelback-web-crawl_filtered.warc
.
$ ${SEARCH_HOME}/linbin/java/bin/java -classpath "/opt/funnelback/lib/java/all/*:/opt/funnelback/conf/my-collection/@groovy/*" com.funnelback.common.filter.util.FilterWarcFile -collection my-collection -filter "TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider" -in /opt/funnelback/data/my-collection/live/data/funnelback-web-crawl -out /opt/funnelback/data/my-collection/live/data/funnelback-web-crawl_filtered -v