Built-in filters - Check if document contains text' (TextDetectionFilterProvider)
The text detection filter is used to classify documents as textual or non-textual documents.
This filter can be used in content auditor reporting, but isn’t enabled by default.
Caveats
-
Plain text documents don’t allow HTML-style metadata to be attached. These documents will show up as (No Value) in the 'Textual vs Non-Textual' facet.
To set up textual vs. non-textual detection on a data source, a few separate parts of Funnelback need to be configured:
After the setup below is complete there should be a new 'Textual vs Non-Textual' widget in the content auditor overview page.
Data source configuration
After the data source configuration changes are complete run a full update of the data source. |
Filter chain
-
From the search dashboard switch select the desired data source and edit the data source configuration.
-
Select edit data source configuration and add or edit the filter.classes parameter key. Add
:TextDetectionFilterProvider
to the end of the set of filters in the chain. The filter does not have to be at the end of the list of filters, but must happen after any text conversion (i.e. after the Tika filter).Parameter key Value filter.classes
EXISTING_FILTERS:TextDetectionFilterProvider
Results page configuration
Content auditor facet metadata
-
Log in to the search dashboard, switch to the results page that corresponds to the content auditor report and view the results page configuration.
-
Select edit results page configuration and add a
ui.modern.content-auditor.facet-metadata.textContent
parameter key.Parameter key Value ui.modern.content-auditor.facet-metadata.textContent
Textual
Faceted navigation
-
Log in to the search dashboard, switch to the results page that corresponds to the content auditor report and view the results page configuration.
-
Select customize faceted navigation and add the following filter on single category facet:
-
Name: Textual vs Non-Textual
-
Category - metadata field:
textContent
-