Using content auditor to report on custom business rules
Use the metadata scraper’s content checking rules to analyze page content for custom attributes and display the information in the content auditor report.
The Built-in filters - Extract metadata from HTML or XML documents (XmlMetadataScraper, MetedataScraper) can be used to define a set of content checking rules that will generate additional metadata that is attached to your documents. These can then be presented in the content auditor reports.
This allows for some basic rule based content checking covering content length, element existence allowing easy checking for rules such as:
- 
<title>tag value must end with| My Site
- 
<title>tag value should match<h1>heading tag value
- 
<title>tag value should match metatagDCTERMS.titlevalue
- 
<title>tag value should match the last item in breadcrumb
- 
Current page in navigation should match <h1>heading tag value
- 
Meta tag descriptionvalue should match metatagDCTERMS.descriptionvalue
- 
Meta tag DCTERMS.identifiershould not containwww.example.com
Basic steps
- Step 1: Enable the metadata scraper filter
- 
Enable the metadata scraper. The exact configuration will depend on whether or not you are writing rules to analyze HTML or XML documents. 
- Step 2: Define metadata scraper rules
- 
Use the different metadata scraper rule types to construct a set of metadata scraper rules. 
- Step 3: Run a full update of your data source
- 
A full update is required because you need to ensure all content is regathered and passed through the metadata scraper filter. 
Example: Identify HTML documents that contain forms
This example shows how to configure the content auditor to report on HTML documents that contain forms, excluding the search box.
Step 1: Enable the metadata scraper
Because we are processing HTML documents we need to set the following data source configuration keys.
- 
Ensure that the JsoupProcessingFilterProvider filter is included in your filter.classesstring.
- 
Ensure that MetadataScraperis included in thefilter.jsoup.classes.
Step 2: Update the collection configuration
Create a metadata scraper rule to detect the presence of a <form> tag and add corresponding metadata mappings.
[{
  "urlRegex": "http://check\\.content\\.existence//",
  "metadataName": "X-FUNNELBACK-CHECK-FORM",
  "elementSelector": "form",
  "extractionType": "text",
  "name":"Form count",
  "checkType":"ELEMENT_EXISTENCE",
  "extractValue":true,
  "description":"Detects the presence of a form tag, also produces a count of h1s detected within the page."
}]Step 3: Update the data source metadata mappings
Update the metadata mappings for your data source to add a mapping for the metadata field produced by the custom rule.
Add a new mapping:
- 
Class: hasform 
- 
Type: text 
- 
Search behavior: display only 
- 
Sources: X-FUNNELBACK-CHECK-FORM- Header or <meta> field
Step 4: Update the data source
Run a full update of the data source by selecting full update from the advanced update options for the data source. A full update is required because a new filter has been added to the data source configuration.
Step 5: Check the search results to ensure the custom metadata has been added
After the update completes check to see if the custom metadata has been added correctly and fix the metadata scraper rule as required.
There are two ways to check for the extra metadata:
- 
Add the custom metadata field ( hasform) to the summary fields (-SFquery processor option) on the results page where you are running the search then run a search and inspect the JSON or HTML output noting thelistMetadataelement for each search result for ahasformelement. This should include the custom metadata field with a value of true or false indicating if the page includes the non search box form.
- 
View the HTML source for the cached version for a search result and observe that the metadata field is present in the HTML. (i.e. <meta name="X-FUNNELBACK-CHECK-FORM" … />).
Step 5: Update the content auditor configuration
Update the content auditor configuration so that it reports on the custom metadata field. The following can be added to the results page configuration and adds a Has web forms item to the overview and search results tabs in content auditor. The additional items should be present in the content auditor report immediately after saving the changes.
ui.modern.content-auditor.facet-metadata.hasform=Has web forms
ui.modern.content-auditor.display-metadata.hasform=Has web forms