Post-gather filtering

Background

This document outlines how to apply post-gather filtering to an existing warc file. This differs from normal inline filtering which occurs as the documents are being gathered.

Usage

$ /opt/funnelback/linbin/java/bin/java -classpath "/opt/funnelback/lib/java/all/*" com.funnelback.common.filter.util.FilterWarcFile

usage: com.funnelback.common.filter.util.FilterWarcFile
 -collection <arg>   Collection to use to read filters specific config
 -filter <arg>       Filter classes to use (Same syntax as in
                     collection.cfg
 -in <arg>           Input WARC file stem
 -out <arg>          Output WARC File stem
 -v                  Enable verbose output
 -w                  If set, the output WARC file will be overwritten if
                     it exists

Caveats

  • Post-gather filtering of binary documents is not supported prior to Funnelback 15.14.0.10.

  • Post-gather filtering allows a warc file to be filtered multiple times. Many filters are designed only to run once so any filters that run post-gather may need modification to account for this scenario, especially if the warc file may be re-filtered multiple times. For example, a filter that modifies the structure of a document may break the document if the same operation is run multiple times in succession, or a filter that injects metadata may inject duplicate metadata if run multiple times. For example:

  • Some filters have pre-requisites that may not be met if some filtering happens at gather time and some filtering happens post-gather.

    • If the Tika filters are run at gather time (to convert PDFs and other binary documents to text) it is not possible to check these for WCAG compliance using the accessibility auditor filters at a later date (because the accessibility auditor filters require the raw binary stream rather than the extracted text).

    • If the filter is appending crawl-time information to the source document (such as HTTP response headers or calculated response times).

Process

  1. Switch to the collection containing the warc file that you wish to filter. If you have a standalone warc file create a new collection first and place the warc file into the collection’s live (or offline) data folder (e.g. $SEARCH_HOME/data/COLLECTION_NAME/live/data/). Although it’s possible to filter a warc file at any location, some filters rely on the collection name (and associated collection.cfg keys) to run correctly.

  2. Ensure the collection.cfg has appropriate configuration defined for any filters. The filter.classes can be ignored as these are set when the post-gather filter command is run. Other collection.cfg based filter options should be set including filter.jsoup.classes and any options that are read by the filters that run (such as filter.jsoup.undesirable_text-source).

  3. Run the post-gather filter command:

    $ ${SEARCH_HOME}/linbin/java/bin/java ${JAVA_OPTS} -classpath "${SEARCH_HOME}/lib/java/all/*:${SEARCH_HOME}/conf/${COLLECTION_NAME}/@groovy/*" com.funnelback.common.filter.util.FilterWarcFile -collection ${COLLECTION_NAME} -filter ${FILTER_CHAIN} -in ${SEARCH_HOME}/data/${COLLECTION_NAME}/${CURRENT_VIEW}/data/funnelback-web-crawl -out ${SEARCH_HOME}/data/${COLLECTION_NAME}/${CURRENT_VIEW}/data/funnelback-web-crawl_filtered -v

    where:

    • ${SEARCH_HOME} is replaced with the $SEARCH_HOME value (Funnelback install root)

    • ${JAVA_OPTS} is replaced with any java options required for the filter process. Options that you may wish to set include:

      • -Xmx to set the available memory heap (eg. -Xmx500m)

      • -Dlog4j.configurationFile=/opt/funnelback/conf/log4j2.xml.default - to use the default logging configuration.

    • ${COLLECTION_NAME} is replaced with the collection name containing the configuration and warc files.

    • ${FILTER_CHAIN} is replaced with the filter chain (filter.classes value)

    • -classpath sets the Java classpath. The value defined in the example able sets the same paths as used when inline filtering runs (the main Funnelback Java libs, plus whatever is in the collection’s @groovy folder).

    • -in sets the input warc file

    • -out sets the output warc file

Example

Filter the warc file (/opt/funnelback/data/my-collection/live/data/funnelback-web-crawl.warc) using the filter chain (TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider) and write the filtered warc file to /opt/funnelback/data/my-collection/live/data/funnelback-web-crawl_filtered.warc.

$ ${SEARCH_HOME}/linbin/java/bin/java -classpath "/opt/funnelback/lib/java/all/*:/opt/funnelback/conf/my-collection/@groovy/*" com.funnelback.common.filter.util.FilterWarcFile -collection my-collection -filter "TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider" -in /opt/funnelback/data/my-collection/live/data/funnelback-web-crawl -out /opt/funnelback/data/my-collection/live/data/funnelback-web-crawl_filtered -v