Using content auditor to report on custom business rules
Introduction
This article provides an example of how to configure Funnelback to analyse page content for custom rules in the content auditor report.
a generic filter (CheckContent) is also available that extends content auditor’s reporting capabilities is available from GitHub and allows for some basic rule based content checking covering content length, element existence allowing easy checking for rules such as: |
-
<title>
tag value must end with| My Site
-
<title>
tag value should match<h1>
heading tag value -
<title>
tag value should match metatagDCTERMS.title
value -
<title>
tag value should match the last item in breadcrumb -
Current page in navigation should match
<h1>
heading tag value -
Meta tag
description
value should match metatagDCTERMS.description
value -
Meta tag
DCTERMS.identifier
should not containwww.example.com
The content auditing GitHub respository also includes some additional content auditing filters for:
-
Language detection (
DetectLang
filter) -
Mixed content detection (
DetectMixedContent
filter) -
Linked RSS feed detection (
DetectRSS
filter) -
Embedded Twitter-style hash tags and user mentions (
SocialTags
filter)
Example: Identify HTML documents that contain forms
This example shows how to configure the content auditor to report on HTML documents that contain forms, excluding the search box.
Steps
Write a Jsoup filter that identifies documents that contain the forms, ignoring the search box.
To achieve this the Jsoup filter identifies all forms within the HTML document and checks the form’s action attribute. If the action is not to call the search then a metadata tag is injected into the page indicating that the document contains a form.
The same metadata tag is injected into HTML documents that contain no forms so that content auditor can report on all the HTML documents and report on both pages with and without embedded forms.
Step 1: Write a Jsoup filter to detect the forms
Create the following Jsoup filter in the collection’s @groovy
folder, with the following filename: $SEARCH_HOME/conf/COLLECTION/@groovy/com/funnelback/COLLECTION/DetectForms.groovy
. Note: replace COLLECTION
in the file path above and also the package name below with the id of the collection you are working on.
package com.funnelback.COLLECTION
import com.funnelback.common.filter.jsoup.*
/**
* Flags pages containing forms that don't have the following actions:
*
* action="http://search.mysite.com/"
*/
@groovy.util.logging.Log4j2
public class DetectForms implements IJSoupFilter {
@Override
void processDocument(FilterContext context) {
def doc = context.getDocument()
def url = doc.baseUri()
// Define a variable that indicates if any forms have been detected for this HTML document.
def hasform = "false"
try {
// Detect all <form> elements
doc.select("form").each() {
// Extract the form's action value
def action = it.attr("action")
// If the action is not the search box then mark the document as having a form
if (action != "http://search.mysite.com/") {
hasform = "true"
}
}
// Inject a custom metadata field that indicates if the HTML document contains a non search box form
context.additionalMetadata.put("custom.hasform", hasform)
if (hasform == "true") {
log.debug("Form detected for '{}'", url)
}
} catch (e) {
log.error("Error scraping metadata from '{}'", url, e)
}
}
}
Step 2: Update the collection configuration
Add the JSoup filter to the collection’s JSoup filter chain (Note: update the COLLECTION
value in the class name to the collection’s ID):
filter.jsoup.classes=ContentGeneratorUrlDetection,FleschKincaidGradeLevel,UndesirableText,TitleDuplicates,com.funnelback.COLLECTION.DetectForms
Add a metadata mapping to the collection to detect the injected custom.hasform
metadata field. This can be mapped as a non-content text type metadata field (v15.14 or later) or type 0 metadata field (v15.12 and earlier).
Step 3: Update the collection
Run a full update of the collection by selecting full update from the advanced update options for the collection. A full update is required because a new filter has been added to the collection configuration.
Step 4: Check the search results to ensure the custom metadata has been added
After the update completes check to see if the custom metadata has been added correct and fix the filter as required.
There are two ways to check for the extra metadata:
-
Add the custom metadata field to the collection’s summary fields (
-SF
query processor option) then run a search and inspect the JSON or HTML output noting the metaData element for each search result. This should include the custom metadata field with a value of true or false indicating if the page includes the non search box form. -
View the HTML source for the cached version for a search result and observe that the metadata field is present in the HTML. (i.e.
<meta name="custom.hasform: true" />
).
Step 5: Update the content auditor configuration
Update the content auditor configuration so that it reports on the custom metadata field. The following can be added to collection.cfg
and adds a Has web forms item to the overview and search results tabs in content auditor. The additional items should be present in the content auditor report immediately after saving the changes.
ui.modern.content-auditor.facet-metadata.hasform=Has web forms
ui.modern.content-auditor.display-metadata.hasform=Has web forms