Content auditor

Introduction

Content Auditor is a tool which allows a search administrator to rapidly gain an understanding of the content of a site or set of sites, with a particular focus on metadata.

Accessing Content Auditor

You can access the Content Auditor from the marketing dashboard, or directly at:

/s/content-auditor.html?collection=(COLLECTION_ID)

Using Content Auditor

Content Auditor is available for every Funnelback results page from the marketing dashboard, and by default provides an overview of some common metadata.

Content Auditor provides a range of sub-reports arranged in tabs, and options to navigate through your content either by keywords (i.e. any valid Funnelback search query), but filtering the report based on URL prefixes, and by filtering based on any metadata value shown within the Content Auditor interface.

Content-auditor-recommendations.png

The first tab within Content Auditor, as shown above, provides a range of 'recommendation' reports which are associated with common content best practices. These reports are as follows:

Reading grade

The reading grade report measures how easily documents are to read, based on the Flesch-Kincaid grade level measure, which relates roughly to the education grade level required to understand the document.

While this measure is an heuristic rather than an exact measurement, it may be useful in ensuring that website content is written at an appropriate level.

The range of 'green' grade levels can be configured with the ui.modern.content-auditor.reading-grade.lower-ok-limit and ui.modern.content-auditor.reading-grade.upper-ok-limit parameters

Missing metadata

The missing metadata report identifies documents for which no metadata of a given type occurs. This may be helpful in enforcing content policies requiring certain types of metadata to be available in all documents within certain areas.

Duplicate titles

The duplicate titles report identifies documents for which the given title is also used by other documents. Duplicated titles can make websites and search result pages less useful, since they lack sufficient context for a user to understand what page is being shown.

For this report to be used with documents which are not originally HTML or filtered to HTML (such as XML records), a copy of the title metadata to be considered must be mapped to the FunDuplicateTitle metadata class.

Date modified

The date modified report presents a chart of when documents were last modified, based on metadata within the documents, and hence may be helpful in identifying documents which should be updated or reviewed.

The allowable document age before it is marked in red can be configured with the Ui.modern.content-auditor.date-modified.ok-age-years setting.

Response time

The response time report provides a chart of the time taken by Funnelback’s web crawler to load each document, which may help to identify documents, sections or entire sites where response time is in need of improvement.

Undesirable text

In its default configuration, the undesirable text report provides information on documents which contain common misspellings, which allows such typos to be rapidly found and corrected.

This report may be configured through the filter.jsoup.undesirable_text-source.* collection configuration setting, which allows for organization-specific lists of undesirable terms, such as outdated product names, to be included within the set to be identified.

Duplicate content

The duplicate content report shows documents for which the content (or if configured, some metadata) is duplicated by other documents. Duplicated content makes site more difficult to navigate, and may also be penalized as a ranking factor by some search engines.

The ui.modern.content-auditor.collapsing-signature configuration parameter can be used to configure exactly what parts of documents are considered for duplication.

Other Content Auditor reports

The overview tab of content, shown below, provides a snapshot of the top metadata within each configured facet of the collection, showing the most common four entries for each. Each category provides a link which can be used to 'drill down', allowing content audit reports to be created for chosen subsets of content. The example below shows a number of facets for a simple example collection.

Content-auditor-overview.png

From the overview page, you can navigate to the attributes tab which provides a complete list of metadata values found in each facet, and estimates of the count of matching documents. Again, clicking on one of the values in the list will restrict subsequent reports to documents containing that metadata value.

Content-auditor-custom-report.png

The third Content Auditor tab provides a list of currently matching search results, with easy links through to various Funnelback tools, as well as to CSV exports of the result list.

The final tab, shown in the example above with the number 15 beside is, shows any sets of duplicate content which was encountered within the collection, and allows this duplicate content to be shown as a result list.

Note also that the search box at the top of the content auditor interface allows auditing reports to be restricted based on any Funnelback query, in addition to the drill-down options.'

Once a facet category has been selected, the constraints applied are displayed by Content Auditor as in the image below.

Content-auditor-clear-filters.png

The small 'x' links to the right of each constraint allow that constraint to be cleared if needed.

Configuring Content Auditor

Content Auditor can be configured in a number of ways to provide relevant reports for different data sets. Most customizations are made by setting results page configuration parameter keys.

The common customizations are:

The following is a full set of content auditor results page parameter keys:

ui.modern.content-auditor.collapsing-signature

Specify how results are determined to be duplicates within content auditor

ui.modern.content-auditor.daat_limit

Specify how many results are examined in creating content auditor reports

ui.modern.content-auditor.display-metadata.[metadataName]

Specify which metadata classes should be displayed within the content auditor’s search results tab.

ui.modern.content-auditor.facet-metadata.[metadata]

specify which metadata classes should be displayed as facets

ui.modern.content-auditor.max-metadata-facet-categories

Specify how many categories are displayed within each facet shown in content auditor.

ui.modern.content-auditor.num_ranks

Specify how many results are displayed in the content auditor search results tab

ui.modern.content-auditor.overview-category-count

Specify how many facet categories are displayed for each facet

ui.modern.content-auditor.preferred-facets

Specifies which content auditor facets will be displayed in the content auditor panel of the marketing dashboard.

Further customizations can implemented using search lifecycle plugins that target the Content Auditor search type.

The following is a full set of content auditor data source parameter keys:

filter.jsoup.undesirable_text-source.[key_name]

Specify sources of undesirable text strings to detect and present within content auditor.\

ui.modern.content-auditor.count_urls

Define how deep into URLs Content Auditor users can navigate using facets.

ui.modern.content-auditor.date-modified.ok-age-years

Define how many years old a document may be before it is considered problematic.

ui.modern.content-auditor.duplicate_num_ranks

Define how many results should be considered in detecting duplicates for Content Auditor.

ui.modern.content-auditor.reading-grade.lower-ok-limit

Define the reading grade below which documents are considered problematic.

ui.modern.content-auditor.reading-grade.upper-ok-limit

Define the reading grade above which documents are considered problematic.

© 2015- Squiz Pty Ltd