Content Auditor is a tool which allows a search administrator to rapidly gain an understanding of the content of a site or set of sites, with a particular focus on metadata.
You can access the Content Auditor from the marketing dashboard, or directly at:
Content Auditor is available for every Funnelback results page from the marketing dashboard, and by default provides an overview of some common metadata.
Content Auditor provides a range of sub-reports arranged in tabs, and options to navigate through your content either by keywords (i.e. any valid Funnelback search query), but filtering the report based on URL prefixes, and by filtering based on any metadata value shown within the Content Auditor interface.
The first tab within Content Auditor, as shown above, provides a range of 'recommendation' reports which are associated with common content best practices. These reports are as follows:
The reading grade report measures how easily documents are to read, based on the Flesch-Kincaid grade level measure, which relates roughly to the education grade level required to understand the document.
While this measure is an heuristic rather than an exact measurement, it may be useful in ensuring that website content is written at an appropriate level.
The range of 'green' grade levels can be configured with the ui.modern.content-auditor.reading-grade.lower-ok-limit and ui.modern.content-auditor.reading-grade.upper-ok-limit parameters
See also: Customize the reading grade chart
The missing metadata report identifies documents for which no metadata of a given type occurs. This may be helpful in enforcing content policies requiring certain types of metadata to be available in all documents within certain areas.
The duplicate titles report identifies documents for which the given title is also used by other documents. Duplicated titles can make websites and search result pages less useful, since they lack sufficient context for a user to understand what page is being shown.
For this report to be used with documents which are not originally HTML or filtered to HTML (such as XML records), a copy of the title metadata to be considered must be mapped to the
The date modified report presents a chart of when documents were last modified, based on metadata within the documents, and hence may be helpful in identifying documents which should be updated or reviewed.
The allowable document age before it is marked in red can be configured with the Ui.modern.content-auditor.date-modified.ok-age-years setting.
The response time report provides a chart of the time taken by Funnelback’s web crawler to load each document, which may help to identify documents, sections or entire sites where response time is in need of improvement.
In its default configuration, the undesirable text report provides information on documents which contain common misspellings, which allows such typos to be rapidly found and corrected.
This report may be configured through the
filter.jsoup.undesirable_text-source.* collection configuration setting, which allows for organization-specific lists of undesirable terms, such as outdated product names, to be included within the set to be identified.
The duplicate content report shows documents for which the content (or if configured, some metadata) is duplicated by other documents. Duplicated content makes site more difficult to navigate, and may also be penalized as a ranking factor by some search engines.
The ui.modern.content-auditor.collapsing-signature configuration parameter can be used to configure exactly what parts of documents are considered for duplication.
The overview tab of content, shown below, provides a snapshot of the top metadata within each configured facet of the collection, showing the most common four entries for each. Each category provides a link which can be used to 'drill down', allowing content audit reports to be created for chosen subsets of content. The example below shows a number of facets for a simple example collection.
From the overview page, you can navigate to the attributes tab which provides a complete list of metadata values found in each facet, and estimates of the count of matching documents. Again, clicking on one of the values in the list will restrict subsequent reports to documents containing that metadata value.
The third Content Auditor tab provides a list of currently matching search results, with easy links through to various Funnelback tools, as well as to CSV exports of the result list.
The final tab, shown in the example above with the number 15 beside is, shows any sets of duplicate content which was encountered within the collection, and allows this duplicate content to be shown as a result list.
Note also that the search box at the top of the content auditor interface allows auditing reports to be restricted based on any Funnelback query, in addition to the drill-down options.'
Once a facet category has been selected, the constraints applied are displayed by Content Auditor as in the image below.
The small 'x' links to the right of each constraint allow that constraint to be cleared if needed.
Content Auditor can be configured in a number of ways to provide relevant reports for different data sets. Most customizations are made by setting results page configuration parameter keys.
The common customizations are:
The following is a full set of content auditor results page parameter keys:
Specify how results are determined to be duplicates within content auditor
Specify how many results are examined in creating content auditor reports
Specify which metadata classes should be displayed within the content auditor’s search results tab.
specify which metadata classes should be displayed as facets
Specify how many categories are displayed within each facet shown in content auditor.
Specify how many results are displayed in the content auditor search results tab
Specify how many facet categories are displayed for each facet
Specifies which content auditor facets will be displayed in the content auditor panel of the marketing dashboard.
Further customizations can implemented using search lifecycle plugins that target the Content Auditor search type.
The following is a full set of content auditor data source parameter keys:
Specify sources of undesirable text strings to detect and present within content auditor.\
Define how deep into URLs Content Auditor users can navigate using facets.
Define how many years old a document may be before it is considered problematic.
Define how many results should be considered in detecting duplicates for Content Auditor.
Define the reading grade below which documents are considered problematic.
Define the reading grade above which documents are considered problematic.