Metadata scraper filter


The metadata scraper filter is used to extract content out of HTML documents and inject it as metadata for the document.


To enable the metadata scraper filter add DocumentMetadataScraper to the filter.jsoup.classes list where <default_jsoup_filters> is the default value.



The filter is configured via a separate file metadata_scraper.json which must reside in the collection configuration folder ($SEARCH_HOME/conf/$COLLECTION/metadata_scraper.json). This file is in JSON format and contains a list of rules to apply to documents, depending if their URL matches a regular expression:

  "urlRegex": "http://example\\.org/",
  "metadataName": "author",
  "elementSelector": "",
  "applyIfNoMatch": false,
  "extractionType": "text",
  "description": "Get author from DIV"
}, {
  "urlRegex": "http://example\\.org/products/",
  "metadataName": "productSku",
  "elementSelector": "div.product p.sku",
  "applyIfNoMatch": false,
  "extractionType": "attr",
  "attributeName": "data-sku"

Each rule is defined with the following attributes:


Regular expression to specify which documents the rule applies to. The URL of the document will be matched against this regular expression and the rule will be applied only if there’s a match.

Because this is a regular expression, special characters like . must be escaped with \. In addition backslashes must be themselves escaped by \ in JSON, resulting in a double backslash: \\.. Without this escaping, . would mean "any character" in the regular expression syntax.


This is the name of the resulting metadata that will get injected in the document. For example if this is set to author, the following will be injected in the document:

<meta name="author" content="...">

If the rule yields multiple values, they will be injected separately:

<meta name="author" content="Shakespeare">
<meta name="author" content="Yeats">


This is a CSS selector to select the HTML element from which to extract the content of the metadata to inject. For example with the following HTML fragment:

<div class="info">
  <div class="author-name">William Shakespeare</div>

And the following selector:, the inner <div> would be selected for extraction.

If you are using this filter to process the following XML document, the element selector to extract FirstName field would be: documents > document > FirstName.

<?xml version="1.0" encoding="UTF-8"?>
        <Title>Vice President, Sales</Title>


This is a boolean which indicates if the rule should get applied when the selector matches (false, this is the default), or when it doesn’t match (true).

This is useful to inject a metadata on documents that don’t match a specific selector. For example:

  "urlRegex": "http://example\\.org/products/",
  "metadataName": "productCategory",
  "elementSelector": "p.category",
  "applyIfNoMatch": true,
  "processMode": "constant",
  "value": "Default category"

With this rule, if a document doesn’t contain a <p> tag with the category class, a productCategory metadata will be injected with the content Default category.

When applyIfNoMatch is set to true, the rule will only run when the elementSelector does not match. In the example above, if the document did contain a <p> tag with the category class, then the productCategory metadata will not be set to anything by this rule. For such a use case, it is recommended that this rule is paired with another one which extracts the category.

Setting "processMode": "constant" is also important here. Without it, the default processMode of regex will be applied and this won’t match any content.


This indicates how the content should be extracted from the selected element. Possible values are:

  • text: The textual content of the matching element will be extracted. If the element contains HTML, all the tags are stripped

  • html: The raw HTML content of the matching element will be extracted.

  • attr: The value of an attribute of the matching element will be extracted. In this mode, attributeName must be provided.

For example, with the following HTML fragment

<div class="product" data-sku="1234">
  <h1>Product title</h1>
  <p>Product description</p>

And the selector div.product:

  • text would result in the content Product title Product description to be extracted

  • html would result in the content <h1>Product title</h1> <p>Product description</p> to be extracted

  • attr, with attributeName: "data-sku" would result in 1234 to be extracted

And also, with the following XML fragment

<?xml version="1.0" encoding="UTF-8"?>
        <Title description="sales">Sales Representative</Title>
            <bar>More content</bar>
  • Selector documents > document > FirstName with extractionType: text would result in the content Janet to be extracted

  • Selector documents > document > foo with extractionType: html would result in the array of metadata ["More content", "</bar>", "<bar>"] to be extracted

  • Selector documents > document > Title with extractionType: attr, with attributeName: "description" would result in sales to be extracted


This specifies the name of the attribute to extract the value from, if extractionType is set to attr. See example from the previous section for details.

processMode and value

This indicates how the extracted content is processed. Possible values are:

  • regex: Apply a regular expression over the extracted content. The regular expression must contain a capture group (using ()) and each match will be injected as a separate metadata

  • constant: Return a hard coded string

This setting works in conjunction with value which indicates either the regular expression to apply, or the hard coded value to use.

This setting is optional. If it’s not set, the complete extracted content is retained as-is.

For example with the HTML fragment:

<div class="info">
  <div class="author-name">William Shakespeare</div>

And the rule:

  "urlRegex": "http://example\\.org/",
  "metadataName": "author",
  "elementSelector": "",
  "applyIfNoMatch": false,
  "extractionType": "text",
  "processMode": "regex",
  "value": "(\\S+)"

would result in two metadata being injected author=William and author=Shakespeare because the regular expression (\\S+) yields 2 matches.

If processMode were set to constant and value to Yeats, a single metadata author=Yeats would have been injected.


This attribute is used to add a comment to the rule. It is optional and is not used when applying the rule.


If you are using built-in XML splitting (configured using the XML processing settings) then any element selectors you supply when configuring this filter must use the original (un-split) XML structure. This is because the built-in XML splitter occurs at index time, so this filter will be applied to the un-split XML document.

If you are using a filter to split your XML file (the metadata scraper filter runs after your XML split filter) then you must use element selectors that reflect the structure of the split XML file.

© 2015- Squiz Pty Ltd