Classifying textual vs. non-textual documents

Funnelback can be configured to guess whether each document contains textual content during a crawl, and display it later using content auditor.

Caveats

  • Plain text documents don’t allow HTML-style metadata to be attached. These documents will show up as (No Value) in the 'Textual vs Non-Textual' facet.

To set up textual vs. non-textual detection on a data source, a few separate parts of Funnelback need to be configured:

After the setup below is complete there should be a new 'Textual vs Non-Textual' widget in the content auditor overview page.

Data source configuration

After the data source configuration changes are complete run a full update of the data source.

Filter chain

  1. From the administration dashboard switch select the desired data source and edit the data source configuration.

  2. Select edit data source configuration and add or edit the filter.classes parameter key. Add :TextDetectionFilterProvider to the end of the set of filters in the chain. The filter does not have to be at the end of the list of filters, but must happen after any text conversion (i.e. after the Tika filter).

    Parameter key Value

    filter.classes

    EXISTING_FILTERS:TextDetectionFilterProvider

Metadata mappings

  1. From the administration dashboard switch select the desired data source and edit the metadata mappings.

  2. Add the following metadata class:

    • Class name: textContent

    • Class type: text

    • Search behaviour: display only

    • Metadata source (HTML/HTTP header type): X-Funnelback-Textual

Results page configuration

Content auditor facet metadata

  1. Log in to the administration dashboard, switch to the results page that corresponds to the content auditor report and view the results page configuration.

  2. Select edit results page configuration and add a ui.modern.content-auditor.facet-metadata.textContent parameter key.

    Parameter key Value

    ui.modern.content-auditor.facet-metadata.textContent

    Textual

Faceted navigation

  1. Log in to the administration dashboard, switch to the results page that corresponds to the content auditor report and view the results page configuration.

  2. Select customise faceted navigation and add the following filter on single category facet:

    • Name: Textual vs Non-Textual

    • Category - metadata field: textContent

See also

© 2015- Squiz Pty Ltd