Validating and concatenating external metadata

Background

External metadata is a useful way of adding metadata to documents based on the URL.

Unfortunately, the Funnelback indexer assumes that external metadata is valid and indexing will fail if an invalid line is detected within the external metadata file.

This article shows how to configure and use the external metadata validator as part of the update workflow.

Usage

Usage instructions for the Mediator validation command can be viewed by running the following from the command line:

$ $SEARCH_HOME/bin/mediator.pl --list-tasks

The metadata validator has two modes of operation:

  • Concatenation mode: validate external metadata files that are stored under each profile and concatenate these into a single (collection-level) external metadata file.

  • Validation mode: validate a specified external metadata file.

Additional options

  • errorThreshhold: allows for the specification of an error threshhold which allows a percentage of errors before failing.

    Lines with errors will be removed from the external metadata.cfg (if running in concatenation mode) and errors will be logged to $SEARCH_HOME/data/$COLLECTION/log/external_metadata.cfg-validation.log. The default value for errorThreshhold default is zero. This means if there is an error during validation and errorThreshhold is not set, it will automatically fail.

Validation mode

The following command can be run on the command line to validate an external metadata file:

$ $SEARCH_HOME/bin/mediator.pl ValidateExternalMetadata collection=$COLLECTION_NAME mode=check

Concatenation mode

The following command can be run on the command line to validate profile-level external metadata files and concatente them into a collection-level external metadata file:

$ $SEARCH_HOME/bin/mediator.pl ValidateExternalMetadata collection=$COLLECTION_NAME mode=concatenate

The validator is normally run from workflow when used in concatenation mode by adding the above command as a workflow step (to a workflow phase that occurs at some time prior to indexing (e.g. pre-gather, post-gather, pre-index).

e.g.

post_gather_command=$SEARCH_HOME/bin/mediator.pl ValidateExternalMetadata collection=$COLLECTION_NAME mode=concatenate