Built-in filters - Update metadata values (MetadataNormaliser)
The metadata normalizer filter cleans, transforms and normalizes metadata values. Normalization is particularly useful for faceted navigation, allowing similar categories to be merged into a single category.
The filter can be applied either for HTML or XML documents where:
-
for HTML documents it processes HTML meta tags (
<meta name="key" content="value">
or<meta property="key" content="value">
) and tests the name/property and content attribute values against regular expressions. The value of meta tagcontent
attribute is replaced when a matching rule runs. -
for XML documents it works with either an XML tag or attribute value and tests against a supplied regular expressions. The tag or attribute is replaced when a matching rule runs.
The metadata normalizer does not work with metadata that is contained within the filter metadata object. This includes most metadata that is generated by a document filter. However, metadata that is generated via Jsoup filters does not use the metadata object and is written directly back into the source document so the metadata normalizer should work for this metadata. |
Usage
Add the MetadataNormaliser
filter to the filter chain:
filter.classes=OTHER-FILTERS:MetadataNormaliser:OTHER-FILTERS
where OTHER-FILTERS
are other filters within the existing filter chain.
The MetadataNormaliser filter should be placed at an appropriate position in the filter chain (which applies the filters from left to right). In most circumstances this should be located towards the end of the filter chain and must be placed before any other filter the needs to access the modifications made by the MetadataNormaliser filter.
|
Configuration
Metadata normalizer rules
The metadata normalizer filter is configured using separate files that specify the transformation rules. The files which can be added using the data source configuration files are named, md_normaliser.RULE.mapping
, where RULE
is a unique identifier.
Each rule file contains a pattern that is compared against metadata fields and a list of rules to apply to transform its values. The file format is as follows:
FIELD_PATTERN_OR_PATH (1)
# some comment (2)
XML_ATTRIBUTE_NAME (3)
TRANSFORMATION_RULES (4)
1 | FIELD_PATTERN_OR_PATH must be the first line of the configuration file. For HTML documents it contains a regular expression that is matched against <META> tag name attributes. For XML documents it is a path used to select the XML field. |
2 | lines (other than the first line) starting with # are considered comments. |
3 | XML_ATTRIBUTE_NAME for XML documents only defines a name of attribute to transform content for |
4 | The TRANSFORMATION_RULES consist of one or more regular expression/replacement rules that are used to transform meta field content attribute value. |
Each rule file must be added to the metadata normalizer filter’s queue by setting the filter.md_normaliser.keys
configuration option on the data source that runs the filter:
filter.md_normaliser.keys=RULE_1,RULE_2
where RULE_N
indicates the matching rule md_normaliser.RULE_N.mapping
.
Defining the metadata field pattern or XML path
HTML documents
The metadata field pattern is the first non-comment line within the rule configuration file and is a regular expression that matched against the metadata field name.
|
If the regular expression matches the metadata field name then the rule will be executed.
|
XML documents
The field path is the first non-comment line within the rule configuration file and is a CSS like selector syntax to find matching XML tag fields.
The field match uses a standard Jsoup element selector - doc.select("FIELD_PATTERN_OR_PATH")
|
Defining the XML attribute name
This optional setting specifies an XML attribute to normalize. If this setting is not used then the normalization rule is applied to the XML tag content rather than the content of an attribute.
This must is used in conjunction with the FIELD_PATTERN_OR_PATH
, which determines the XML tag to apply the rule to.
The attribute must be defined using the following syntax: xml:attr:NAME_OF_XML_ATTRIBUTE
.
|
Defining the transformation rules
The transformation rules are regular expression/transformation pairs that are applied to the content attribute value of the selected metadata field.
One or more transformation rules can be defined.
Each rule follows a REGEX=REPLACEMENT
format where:
-
REGEX
: A regular expression that is compared to thecontent
attribute value of the metadata field. May include capture groups. The rule must fully match the field content. -
REPLACEMENT
: The replacement regular expression for thecontent
attribute value. The replacement expression can include substitutions based on capture groups. e.g.(.*)@domain.com=$1
may be used to capture a username.
|
Examples
Set a default value for a metadata field with empty content
This example sets a default value for a specific metadata field if the field’s content contains an empty value.
Update <meta name="category">
fields where the value is empty to have a default value of Media
.
-
Add the
MetadataNormaliser
filter into the chain by settingfilter.classes
in the data source configuration. -
Create a
md_normaliser.category.mapping
file with the following rules:category (1) =Media (2)
1 The metadata field pattern only matches a metadata field with name=category
.2 The transformation rule matches an empty content
attribute value (left hand side of the=
) and sets a static value Media as the replacement value. -
Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the category rule.
filter.md_normaliser.keys=category
Strip unwanted titles from a name
This example strips titles from a field containing names.
-
Add the
MetadataNormaliser
filter into the chain by settingfilter.classes
in the data source configuration. -
Create a
md_normaliser.author.mapping
file with the below rules -
Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the author rule.
filter.md_normaliser.keys=author
HTML documents
Update <meta name="author">
fields to remove titles, such as Mr and Mrs from names contained within the content
attribute value.
author (1) (?i)\b[mr|mrs|ms|miss]+\b(.*)=$1 (2)
1 | The metadata field pattern only matches a metadata field with name=author . |
2 | Extracts all the text following mr, mrs, ms or miss as a word and replaces the content attribute value with this text. e.g. a value of Mr John Smith will be replaced with John Smith . The text is matched case insensitively. |
XML documents
Update <author><name>…</name></author>
fields to remove titles, such as Mr and Mrs from names contained within the XML tag field text.
author > name (1) (?i)\b[mr|mrs|ms|miss]+\b(.*)=$1 (2)
1 | The field path finds each XML tag following <author><name> structure. |
2 | Extracts all the text following mr, mrs, ms or miss as a word and replaces the content attribute value with this text. e.g. a value of Mr John Smith will be replaced with John Smith . The text is matched case insensitively. |
Make the value of a metadata field consistent
This example identifies three variant metadata field values and normalizes them so that they are set to a consistent, preferred, value.
Update <meta name="author">
fields to make variants of jsmith
, j. smith
and johnny smith
consistent with a preferred value of John Smith
.
-
Add the
MetadataNormaliser
filter into the chain by settingfilter.classes
in the data source configuration. -
Create a
md_normaliser.author.mapping
file with the following rules:author jsmith=John Smith jack smith=John Smith j\. smith=John Smith (?i)johnny smith=John Smith
-
Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the author rule set.
filter.md_normaliser.keys=author
Clean an XML id attribute
This example cleans up item id attributes for XML documents.
-
Add the
MetadataNormaliser
filter into the chain by settingfilter.classes
in the data source configuration. -
Create a
md_normaliser.id.mapping
file with the following rules:items > item[id] (1) xml:attr:id (2) prefix-(.*)=$1 (3)
1 The field path finds each XML tag following <items><item id="…">
structure where<item>
has attributeid
.2 Configures the rule to transform an XML attribute, id
3 Removes prefix-
from id, e.g. a value ofprefix-123
will be normalized to123
. -
Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the id rule set.
filter.md_normaliser.keys=id
Inject a default value into an empty XML field
This example shows how to inject a default value into any XML field that is empty. You may need to do this if you have groups of nested XML fields that you need to keep together for display purposes, ensuring your listMetadata
arrays have the same number of elements for these nested fields. This is important if you need to iterate over a set of grouped metadata fields and use a common array index to keep related metadata grouped.
-
Add the
MetadataNormaliser
filter into the chain by settingfilter.classes
in the data source configuration. -
Create a
md_normaliser.emptyfields.mapping
file with the following rules::not(:has(*)) (1) ^\s*$=NULL (2)
1 The field path finds each XML tag that doesn’t have any child elements. 2 The rule looks for a field containing no content, and replaces this with the value NULL
-
Set the following data source configuration option to be set to ensure that the metadata normalizer filter executes the emptyfields rule set.
filter.md_normaliser.keys=emptyfields
Replace new line characters with <br> tags in XML field content
This example shows how to update XML field content that contains paragraphs of text, to replace any line breaks with HTML <br>
tags.
-
Add the
MetadataNormaliser
filter into the chain by settingfilter.classes
in the data source configuration. -
Create a
md_normaliser.nltobr.mapping
file with the following rules:DESC,LONGDESC (1) # Replace line breaks with an HTML <br> tag (?ms)([^\n]+).*?=$1<p> (2)
1 The field path updates <DESC>
and<LONGDESC>
elements.2 The rule looks for new lines in field content, and replaces the new line with a <br>
tag. -
Set the following data source configuration option to ensure the metadata normalizer filter executes the nltobr rule set.
filter.md_normaliser.keys=nltobr