Built-in filters - Metadata normalizer filter (MetadataNormaliser)

The metadata normalizer filter cleans, transforms and normalizes metadata values. Normalization is particularly useful for faceted navigation, allowing similar categories to be merged into a single category.

The filter can be applied either for HTML or XML documents where:

  • for HTML documents it processes HTML meta tags (<meta name="key" content="value"> or <meta property="key" content="value">) and tests the name/property and content attribute values against regular expressions. The value of meta tag content attribute is replaced when a matching rule runs.

  • for XML documents it works with either an XML tag or attribute value and tests against a supplied regular expressions. The tag or attribute is replaced when a matching rule runs.

The metadata normalizer does not work with metadata that is contained within the filter metadata object. This includes most metadata that is generated by a document filter. However, metadata that is generated via Jsoup filters does not use the metadata object and is written directly back into the source document so the metadata normalizer should work for this metadata.

Usage

Add the MetadataNormaliser filter to the filter chain:

filter.classes=OTHER-FILTERS:MetadataNormaliser:OTHER-FILTERS

where OTHER-FILTERS are other filters within the existing filter chain.

The MetadataNormaliser filter should be placed at an appropriate position in the filter chain (which applies the filters from left to right). In most circumstances this should be located towards the end of the filter chain and must be placed before any other filter the needs to access the modifications made by the MetadataNormaliser filter.

Configuration

Metadata normalizer rules

The metadata normalizer filter is configured using separate files that specify the transformation rules. The files which can be added using the data source configuration files are named, md_normaliser.RULE.mapping, where RULE is a unique identifier.

Each rule file contains a pattern that is compared against metadata fields and a list of rules to apply to transform its values. The file format is as follows:

FIELD_PATTERN_OR_PATH (1)
# some comment (2)
XML_ATTRIBUTE_NAME (3)
TRANSFORMATION_RULES (4)
1 FIELD_PATTERN_OR_PATH must be the first line of the configuration file. For HTML documents it contains a regular expression that is matched against <META> tag name attributes. For XML documents it is a path used to select the XML field.
2 lines (other than the first line) starting with # are considered comments.
3 XML_ATTRIBUTE_NAME for XML documents only defines a name of attribute to transform content for
4 The TRANSFORMATION_RULES consist of one or more regular expression/replacement rules that are used to transform meta field content attribute value.

Each rule file must be added to the metadata normalizer filter’s queue by setting the filter.md_normaliser.keys configuration option on the data source that runs the filter:

filter.md_normaliser.keys=RULE_1,RULE_2

where RULE_N indicates the matching rule md_normaliser.RULE_N.mapping.

Defining the metadata field pattern or XML path

HTML documents

The metadata field pattern is the first non-comment line within the rule configuration file and is a regular expression that matched against the metadata field name.

  • The regular expression must be defined using Java regular expression syntax.

  • The match is case-insensitive.

If the regular expression matches the metadata field name then the rule will be executed.

  • To update a single meta tag, provide its name. e.g. author, publisher

  • To update multiple meta tags within single mapping use "OR" regular expression operator to provide multiple meta tag names i.e. author|publisher

  • To update multiple meta tags within single mapping prefixed with different namespaces i.e. dc.author, og:author, author use (\\s+[:|\.])?author

XML documents

The field path is the first non-comment line within the rule configuration file and is a CSS like selector syntax to find matching XML tag fields.

  • To update content of specific XML tag field, provide path following parent > child syntax e.g. files > file > title

  • To update content of specific XML tag field containing attribute id, provide path following parent > child[attribute] syntax. e.g. files > file > title[id]

  • To update the content of a specific XML tag field with a namespace i.e. <dc:title>, provide the path following parent > namespace|child syntax e.g. files > file > dc|title

Defining the XML attribute name

This optional setting specifies an XML attribute to normalize. If this setting is not used then the normalization rule is applied to the XML tag content rather than the content of an attribute.

This must is used in conjunction with the FIELD_PATTERN_OR_PATH, which determines the XML tag to apply the rule to.

The attribute must be defined using the following syntax: xml:attr:NAME_OF_XML_ATTRIBUTE.

  • Applies only if processing XML document.

  • Multiple patterns may be defined, however only the first match will be processed (the rest will be ignored).

Defining the transformation rules

The transformation rules are regular expression/transformation pairs that are applied to the content attribute value of the selected metadata field.

One or more transformation rules can be defined.

Each rule follows a REGEX=REPLACEMENT format where:

  • REGEX: A regular expression that is compared to the content attribute value of the metadata field. May include capture groups.

  • REPLACEMENT: The replacement regular expression for the content attribute value. The replacement expression can include substitutions based on capture groups. e.g. (.*)@domain.com=$1 may be used to capture a username.

  • Regex used in transformation rules is case-sensitive by default. To make an expression case insensitive, prefix it with (?i).

  • Transformation rules are executed in the defined order.

Examples

Set a default value for a metadata field with empty content

This example sets a default value for a specific metadata field if the field’s content contains an empty value.

Update <meta name="category"> fields where the value is empty to have a default value of Media.

  1. Add the MetadataNormaliser filter into the chain by setting filter.classes in the data source configuration.

  2. Create a md_normaliser.category.mapping file with the following rules:

    category (1)
    =Media (2)
    1 The metadata field pattern only matches a metadata field with name=category.
    2 The transformation rule matches an empty content attribute value (left hand side of the =) and sets a static value Media as the replacement value.
  3. Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the category rule.

    filter.md_normaliser.keys=category

Strip unwanted titles from a name

This example strips titles from a field containing names.

  1. Add the MetadataNormaliser filter into the chain by setting filter.classes in the data source configuration.

  2. Create a md_normaliser.author.mapping file with the below rules

  3. Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the author rule.

    filter.md_normaliser.keys=author

HTML documents

Update <meta name="author"> fields to remove titles, such as Mr and Mrs from names contained within the content attribute value.

author (1)
(?i)\b[mr|mrs|ms|miss]+\b(.*)=$1 (2)
1 The metadata field pattern only matches a metadata field with name=author.
2 Extracts all the text following mr, mrs, ms or miss as a word and replaces the content attribute value with this text. e.g. a value of Mr John Smith will be replaced with John Smith. The text is matched case insensitively.

XML documents

Update <author><name>…​</name></author> fields to remove titles, such as Mr and Mrs from names contained within the XML tag field text.

author > name (1)
(?i)\b[mr|mrs|ms|miss]+\b(.*)=$1 (2)
1 The field path finds each XML tag following <author><name> structure.
2 Extracts all the text following mr, mrs, ms or miss as a word and replaces the content attribute value with this text. e.g. a value of Mr John Smith will be replaced with John Smith. The text is matched case insensitively.

Make the value of a metadata field consistent

This example identifies three variant metadata field values and normalizes them so that they are set to a consistent, preferred, value.

Update <meta name="author"> fields to make variants of jsmith, j. smith and johnny smith consistent with a preferred value of John Smith.

  1. Add the MetadataNormaliser filter into the chain by setting filter.classes in the data source configuration.

  2. Create a md_normaliser.author.mapping file with the following rules:

    author
    jsmith=John Smith
    jack smith=John Smith
    j\. smith=John Smith
    (?i)johnny smith=John Smith
  3. Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the author rule set.

    filter.md_normaliser.keys=author

Clean an XML id attribute

This example cleans up item id attributes for XML documents.

  1. Add the MetadataNormaliser filter into the chain by setting filter.classes in the data source configuration.

  2. Create a md_normaliser.id.mapping file with the following rules:

    items > item[id] (1)
    xml:attr:id (2)
    prefix-(.*)=$1 (3)
    1 The field path finds each XML tag following <items><item id="…​"> structure where <item> has attribute id.
    2 Configures the rule to transform an XML attribute, id
    3 Removes prefix- from id, e.g. a value of prefix-123 will be normalized to 123.
  3. Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the id rule set.

    filter.md_normaliser.keys=id