Using content auditor to report on custom business rules

Introduction

This article provides an example of how to configure Funnelback to analyse page content for custom rules in the content auditor report.

a generic filter (CheckContent) is also available that extends content auditor’s reporting capabilities is available from GitHub and allows for some basic rule based content checking covering content length, element existence allowing easy checking for rules such as:
  • <title> tag value must end with | My Site

  • <title> tag value should match <h1> heading tag value

  • <title> tag value should match metatag DCTERMS.title value

  • <title> tag value should match the last item in breadcrumb

  • Current page in navigation should match <h1> heading tag value

  • Meta tag description value should match metatag DCTERMS.description value

  • Meta tag DCTERMS.identifier should not contain www.example.com

The content auditing GitHub respository also includes some additional content auditing filters for:

  • Language detection (DetectLang filter)

  • Mixed content detection (DetectMixedContent filter)

  • Linked RSS feed detection (DetectRSS filter)

  • Embedded Twitter-style hash tags and user mentions (SocialTags filter)

Example: Identify HTML documents that contain forms

This example shows how to configure the content auditor to report on HTML documents that contain forms, excluding the search box.

Steps

Write a Jsoup filter that identifies documents that contain the forms, ignoring the search box.

To achieve this the Jsoup filter identifies all forms within the HTML document and checks the form’s action attribute. If the action is not to call the search then a metadata tag is injected into the page indicating that the document contains a form.

The same metadata tag is injected into HTML documents that contain no forms so that content auditor can report on all the HTML documents and report on both pages with and without embedded forms.

Step 1: Write a Jsoup filter to detect the forms

Create the following Jsoup filter in the collection’s @groovy folder, with the following filename: $SEARCH_HOME/conf/COLLECTION/@groovy/com/funnelback/COLLECTION/DetectForms.groovy. Note: replace COLLECTION in the file path above and also the package name below with the id of the collection you are working on.

package com.funnelback.COLLECTION

import com.funnelback.common.filter.jsoup.*

/**
 * Flags pages containing forms that don't have the following actions:
 *
 *  action="http://search.mysite.com/"
 */
@groovy.util.logging.Log4j2
public class DetectForms implements IJSoupFilter {

   @Override
   void processDocument(FilterContext context) {
        def doc = context.getDocument()
        def url = doc.baseUri()

        // Define a variable that indicates if any forms have been detected for this HTML document.
        def hasform = "false"

        try {

            // Detect all <form> elements
            doc.select("form").each() {
                // Extract the form's action value
                def action = it.attr("action")
                // If the action is not the search box then mark the document as having a form
                if (action != "http://search.mysite.com/") {
                    hasform = "true"
                }
            }

            // Inject a custom metadata field that indicates if the HTML document contains a non search box form
            context.additionalMetadata.put("custom.hasform", hasform)
            if (hasform == "true") {
                log.debug("Form detected for '{}'", url)
            }

        } catch (e) {
            log.error("Error scraping metadata from '{}'", url, e)
        }
    }
}

Step 2: Update the collection configuration

Add the JSoup filter to the collection’s JSoup filter chain (Note: update the COLLECTION value in the class name to the collection’s ID):

filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,TitleDuplicates,com.funnelback.COLLECTION.DetectForms

Add a metadata mapping to the collection to detect the injected custom.hasform metadata field. This can be mapped as a non-content text type metadata field (v15.14 or later) or type 0 metadata field (v15.12 and earlier).

Step 3: Update the collection

Run a full update of the collection by selecting full update from the advanced update options for the collection. A full update is required because a new filter has been added to the collection configuration.

Step 4: Check the search results to ensure the custom metadata has been added

After the update completes check to see if the custom metadata has been added correct and fix the filter as required.

There are two ways to check for the extra metadata:

  1. Add the custom metadata field to the collection’s summary fields (-SF query processor option) then run a search and inspect the JSON or HTML output noting the metaData element for each search result. This should include the custom metadata field with a value of true or false indicating if the page includes the non search box form.

  2. View the HTML source for the cached version for a search result and observe that the metadata field is present in the HTML. (i.e. <meta name="custom.hasform: true" />).

Step 5: Update the content auditor configuration

Update the content auditor configuration so that it reports on the custom metadata field. The following can be added to collection.cfg and adds a Has web forms item to the overview and search results tabs in content auditor. The additional items should be present in the content auditor report immediately after saving the changes.

ui.modern.content-auditor.facet-metadata.hasform=Has web forms
ui.modern.content-auditor.display-metadata.hasform=Has web forms