Using content auditor to report on custom business rules

Use the metadata scraper’s content checking rules to analyze page content for custom attributes and display the information in the content auditor report.

The Built-in filters - Extract metadata from HTML or XML documents (XmlMetadataScraper, MetedataScraper) can be used to define a set of content checking rules that will generate additional metadata that is attached to your documents. These can then be presented in the content auditor reports.

This allows for some basic rule based content checking covering content length, element existence allowing easy checking for rules such as:

  • <title> tag value must end with | My Site

  • <title> tag value should match <h1> heading tag value

  • <title> tag value should match metatag DCTERMS.title value

  • <title> tag value should match the last item in breadcrumb

  • Current page in navigation should match <h1> heading tag value

  • Meta tag description value should match metatag DCTERMS.description value

  • Meta tag DCTERMS.identifier should not contain

Basic steps

Step 1: Enable the metadata scraper filter

Enable the metadata scraper. The exact configuration will depend on whether or not you are writing rules to analyze HTML or XML documents.

Step 2: Define metadata scraper rules

Use the different metadata scraper rule types to construct a set of metadata scraper rules.

Step 3: Run a full update of your data source

A full update is required because you need to ensure all content is regathered and passed through the metadata scraper filter.

Example: Identify HTML documents that contain forms

This example shows how to configure the content auditor to report on HTML documents that contain forms, excluding the search box.

Step 1: Enable the metadata scraper

Because we are processing HTML documents we need to set the following data source configuration keys.

  1. Ensure that the JsoupProcessingFilterProvider filter is included in your filter.classes string.

  2. Ensure that MetadataScraper is included in the filter.jsoup.classes.

Step 2: Update the collection configuration

Create a metadata scraper rule to detect the presence of a <form> tag and add corresponding metadata mappings.

  "urlRegex": "http://check\\.content\\.existence//",
  "metadataName": "X-FUNNELBACK-CHECK-FORM",
  "elementSelector": "form",
  "extractionType": "text",
  "name":"Form count",
  "description":"Detects the presence of a form tag, also produces a count of h1s detected within the page."

Step 3: Update the data source metadata mappings

Update the metadata mappings for your data source to add a mapping for the metadata field produced by the custom rule.

Add a new mapping:

  • Class: hasform

  • Type: text

  • Search behavior: display only

  • Sources: X-FUNNELBACK-CHECK-FORM - Header or <meta> field

Step 4: Update the data source

Run a full update of the data source by selecting full update from the advanced update options for the data source. A full update is required because a new filter has been added to the data source configuration.

Step 5: Check the search results to ensure the custom metadata has been added

After the update completes check to see if the custom metadata has been added correctly and fix the metadata scraper rule as required.

There are two ways to check for the extra metadata:

  1. Add the custom metadata field (hasform) to the summary fields (-SF query processor option) on the results page where you are running the search then run a search and inspect the JSON or HTML output noting the listMetadata element for each search result for a hasform element. This should include the custom metadata field with a value of true or false indicating if the page includes the non search box form.

  2. View the HTML source for the cached version for a search result and observe that the metadata field is present in the HTML. (i.e. <meta name="X-FUNNELBACK-CHECK-FORM" …​ />).

Step 5: Update the content auditor configuration

Update the content auditor configuration so that it reports on the custom metadata field. The following can be added to the results page configuration and adds a Has web forms item to the overview and search results tabs in content auditor. The additional items should be present in the content auditor report immediately after saving the changes.

ui.modern.content-auditor.facet-metadata.hasform=Has web forms
ui.modern.content-auditor.display-metadata.hasform=Has web forms