Built-in filters - Extract metadata from HTML or XML documents (XmlMetadataScraper, MetedataScraper)

The metadata scraper filters are used to extract content from HTML and XML documents and inject it as metadata for the document.

There are two filters that scrape metadata:

  • MetadataScraper: This is a Jsoup filter that extracts metadata from HTML documents

  • XmlMetadataScraper: This is a document filter that extracts metadata from XML documents.

Both filters are configured in the same manner and use the same configuration file.

Enabling the metadata scraper

If you need to scrape metadata from both HTML and XML documents then add both filters to your filter chains.

Extracting metadata from HTML documents

To enable the (HTML) metadata scraper filter add MetadataScraper to the filter.jsoup.classes list where <jsoup_filters> are other filters in the Jsoup filter chain.

filter.jsoup.classes=<jsoup_filters>,MetadataScraper,<jsoup_filters>

Extracting metadata from XML documents

To enable the XML metadata scraper filter add XmlMetadataScraper to the filter.classes list where <filters> are other filters in the filter chain.

filter.classes=<filters>:XmlMetadataScraper:<filters>

Configuration

The filters are configured via a separate file metadata_scraper.json which can be added using the data source configuration files.

This file is in JSON format and contains a list of rules to apply to documents, depending if their URL matches a regular expression:

[{
  "urlRegex": "http://example\\.org/",
  "metadataName": "author",
  "elementSelector": "div.author-name",
  "applyIfNoMatch": false,
  "extractionType": "text",
  "description": "Get author from DIV"
}, {
  "urlRegex": "http://example\\.org/products/",
  "metadataName": "productSku",
  "elementSelector": "div.product p.sku",
  "applyIfNoMatch": false,
  "extractionType": "attr",
  "attributeName": "data-sku"
}, {
  "urlRegex": "http://check\\.content\\.length//",
  "metadataName": "X-FUNNELBACK-TITLE-LENGTH",
  "elementSelector": "title",
  "extractionType": "text",
  "description": "Identifies if the document contains any H1 values that exceed 55 characters in length.",
  "name": "Title length",
  checkType: "ELEMENT_LENGTH",
  "length": 55,
  "extractValue": true
}, {
  "urlRegex": "http://check\\.content\\.existence//",
  "metadataName": "X-FUNNELBACK-CHECK-H1",
  "elementSelector": "h1",
  "extractionType": "text",
  "name":"H1 count",
  "checkType":"ELEMENT_EXISTENCE",
  "extractValue":true,
  "description":"Detects the presence of a h1, also produces a count of h1s detected within the page."
}, {
  "urlRegex": "http://check\\.content//",
  "metadataName": "X-FUNNELBACK-CHECK-DATE",
  "elementSelector": "meta[name=dcterms.date]",
  "extractionType": "text",
  "attributeName": "content"
  "name":"DC Date format",
  "checkType":"ELEMENT_CONTENT",
  "comparator":"MATCHES",
  "compareText":"\\d{4}-\\d{2}-\\d{2}",
  "extractValue":true,
  "description":"Identifies DC Date fields that contain dates matching YYYY-MM-DD."
}]

Each rule is defined with the following attributes:

urlRegex

Regular expression to specify which documents the rule applies to. The URL of the document will be matched against this regular expression and the rule will be applied only if there’s a match.

Because this is a regular expression, special characters like . must be escaped with \. In addition, backslashes must also be escaped with \, resulting in a double backslash: \\.. Without this escaping, . would mean "any character" in the regular expression syntax.

metadataName

This is the name of the resulting metadata that will get injected in the document. For example if this is set to author, the following will be injected in the document:

<meta name="author" content="...">

If the rule yields multiple values, they will be injected separately:

<meta name="author" content="Shakespeare">
<meta name="author" content="Yeats">

elementSelector

HTML documents

For HTML documents, this is a CSS selector to select the HTML element from which to extract the content of the metadata to inject. For example with the following HTML fragment:

<div class="info">
  <div class="author-name">William Shakespeare</div>
</div>

And the following selector: div.author-name, the inner <div> would be selected for extraction.

XML documents

For XML, if you are using this filter to process the following XML document, the element selector to extract FirstName field would be: documents > document > FirstName.

<?xml version="1.0" encoding="UTF-8"?>
<documents>
    <document>
        <id>Employee1</id>
        <LastName>Fuller</LastName>
        <FirstName>Andrew</FirstName>
        <Title>Vice President, Sales</Title>
    </document>
</documents>

applyIfNoMatch

This is a boolean which indicates if the rule should get applied when the selector matches (false, this is the default), or when it doesn’t match (true).

This is useful to inject a metadata on documents that don’t match a specific selector. For example:

{
  "urlRegex": "http://example\\.org/products/",
  "metadataName": "productCategory",
  "elementSelector": "p.category",
  "applyIfNoMatch": true,
  "processMode": "constant",
  "value": "Default category"
}

With this rule, if a document doesn’t contain a <p> tag with the category class, a productCategory metadata will be injected with the content Default category.

When applyIfNoMatch is set to true, the rule will only run when the elementSelector does not match. In the example above, if the document did contain a <p> tag with the category class, then the productCategory metadata will not be set to anything by this rule. For such a use case, it is recommended that this rule is paired with another one which extracts the category.

Setting "processMode": "constant" is also important here. Without it, the default processMode of regex will be applied and this won’t match any content.

extractionType

HTML documents

This indicates how the content should be extracted from the selected element. Possible values are:

  • text: The textual content of the matching element will be extracted. If the element contains HTML, all the tags are stripped

  • html (inner content): The raw HTML (or XML) content of the matching element will be extracted.

  • attr: The value of an attribute of the matching element will be extracted. In this mode, attributeName must be provided.

For example, with the following HTML fragment

<div class="product" data-sku="1234">
  <h1>Product title</h1>
  <p>Product description</p>
</div>

And the selector div.product:

  • text would result in the content Product title Product description to be extracted

  • html would result in the content <h1>Product title</h1> <p>Product description</p> to be extracted

  • attr, with attributeName: "data-sku" would result in 1234 to be extracted

XML documents

For the following XML fragment

<?xml version="1.0" encoding="UTF-8"?>
<documents>
    <document>
        <id>Employee3</id>
        <LastName>Leverling</LastName>
        <FirstName>Janet</FirstName>
        <Title description="sales">Sales Representative</Title>
        <foo>
            <bar>More content</bar>
        </foo>
    </document>
</documents>
  • Selector documents > document > FirstName with extractionType: text would result in the content Janet to be extracted

  • Selector documents > document > foo with extractionType: html would result in the array of metadata ["More content", "</bar>", "<bar>"] to be extracted

  • Selector documents > document > Title with extractionType: attr, with attributeName: "description" would result in sales to be extracted

attributeName

This specifies the name of the attribute to extract the value from, if extractionType is set to attr. See example from the previous section for details.

processMode and value

This indicates how the extracted content is processed. Possible values are:

  • regex: Apply a regular expression over the extracted content. The regular expression must contain a capture group (using ()) and each match will be injected as a separate metadata. The first regex capture group (ignored by default) may be returned by setting the includeFirstCaptureGroup option.

  • constant: Return a hard coded string

This setting works in conjunction with value which indicates either the regular expression to apply, or the hard coded value to use.

This setting is optional. If it’s not set, the complete extracted content is retained as-is.

For example with the HTML fragment:

<div class="info">
  <div class="author-name">William Shakespeare</div>
</div>

And the rule:

{
  "urlRegex": "http://example\\.org/",
  "metadataName": "author",
  "elementSelector": "div.author-name",
  "applyIfNoMatch": false,
  "extractionType": "text",
  "processMode": "regex",
  "value": "(\\S+)"
}

would result in two metadata being injected author=William and author=Shakespeare because the regular expression (\\S+) yields 2 matches.

If processMode were set to constant and value to Yeats, a single metadata author=Yeats would have been injected.

description

This attribute is used to add a description to the metadata scraper rule. It is optional and is not used when applying the rule.

name

This field contains a (human-friendly) name to assign to the metadata scraper rule. It is optional and is not used when applying the rule.

checkType

Indicates a content checking operation to execute. The following values are supported:

  • ELEMENT_EXISTENCE: To check the existence of elements based on configuration.

  • ELEMENT_LENGTH: To check the length of extracted value based on configuration. When checking length, you must also configure comparator and length values (see below).

  • ELEMENT_CONTENT: To check extracted value content based on configuration. When checking content, you must also configure comparator and compareText values (see below).

  • DETECT_MIXED_CONTENT: To detect the mixed content.

  • DETECT_RSS: To detect the RSS feed.

  • DEFAULT: The default behavior to extract value based on configuration.

The ELEMENT_* rules must be used in conjunction with an elementSelector field.

extractValue

Indicates if the value of the compareField should be extracted.

Possible values:

  • true: extract the value of compareField and write this to the metadataName metadata field.

comparator

Indicates the comparator to use for ELEMENT_LENGTH and ELEMENT_CONTENT content checking rules.

The following values are supported when the checkType is ELEMENT_LENGTH:

  • LENGTH_EQ_CHARS: To check if the number of characters of extracted value equals the configured number.

  • LENGTH_GT_CHARS: To check if the number of characters of extracted value exceeds the configured number.

  • LENGTH_LT_CHARS: To check if the number of characters of extracted value is lower than the configured number.

  • LENGTH_EQ_WORDS: To check if the number of words of extracted value equals the configured number.

  • LENGTH_GT_WORDS: To check if the number of words of extracted value exceeds the configured number.

  • LENGTH_LT_WORDS: To check if the number of words of extracted value is lower than the configured number.

The following values are supported when the checkType is CHECK_CONTENT:

  • STARTS_WITH: To check if the extracted value starts with the configured text.

  • NOT_STARTS_WITH: To check if the extracted value is not starting with the configured text.

  • ENDS_WITH: To check if the extracted value ends with the configured text.

  • NOT_ENDS_WITH: To check if the extracted value is not ending with the configured text.

  • EQUALS: To check if the extracted value is the same with configured text.

  • NOT_EQUALS: To check if the extracted value is not the same with configured text.

  • CONTAINS: To check if the extracted value contains configured text.

  • NOT_CONTAINS: To check if the extracted value is not containing configured text.

  • MATCHES: To check if the extracted value matches the configured regex or string.

  • NOT_MATCHES: To check if the extracted value is not matching the configured regex or string.

  • FULLY_MATCHES: To check if the extracted value fully matches the configured regex or string.

  • NOT_FULLY_MATCHES: To check if the extracted value is not fully matching the configured regex or string.

compareText

Defines text to compare with the content when the check type is ELEMENT_CONTENT. The compareText is either a string or regex and is compared using the selected comparator.

length

Defines the length value to apply against the extracted value when the check type is ELEMENT_LENGTH. The length value compared using the selected comparator.

includeFirstCaptureGroup

When using the regex process mode the first capture group is ignored by default as this returns the full capture. Setting this to true changes the behavior to include the first capture group, and provides compatibility with old metadata scraper configurations which included the first capture group.

If you are using built-in XML splitting (configured using the XML processing settings) then any element selectors you supply when configuring this filter must use the original (un-split) XML structure. This is because the built-in XML splitter occurs at index time, so this filter will be applied to the un-split XML document.

If you are using a filter to split your XML file (ensuring that the metadata scraper filter runs after your XML split filter) then you must use element selectors that reflect the structure of the split XML file.