Plugin: Human-readable document type

Purpose

Use this plugin if you need to curate the file type names that are recorded for documents, if presenting file type information in facets or result summaries.

The file type metadata produced by the crawler does not produce nice labels for detected filetypes, or apply any logical groupings when assigning the type.

If used for faceted navigation, this results is a file type facet that is not very useful, resulting in choices like text/html, doc, docx.

This plugin produces an alternate file type metadata field that is much better suited to use in faceted navigation and search results display, and also includes additional intelligence when detecting the file types. The plugin also provides configuration options that allows you to override and define file type labels and groupings.

By default, the plugin uses the discovered type to attach sensible labels and group the types as follows:

Discovered by Tika Saved in metadata

text/html

HTML

application/json

JSON

application/xml

XML

text/csv

CSV

image/jpeg

JPEG

application/vnd.ms-excel

Excel

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Excel

application/msword

Word

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Word

application/vnd.ms-powerpoint

PowerPoint

application/vnd.openxmlformats-officedocument.presentationml.slideshow

PowerPoint

application/pdf

PDF

application/zip

ZIP

application/rtf

RTF

The Discovered by Tika value is equivalent to the Parameter 1 value when setting the File type to assign configuration option. The Saved in metadata value is equivalent to value you define for the File type to assign configuration option. e.g. the first entry in this table is equivalent to a plugin configuration key of:

Configuration key name Parameter 1 Value

File type to assign

text/html

HTML

Users can override any type and/or add definitions to the table.

The file type metadata can then be used in subsequent filters, or for use in front-end functionality such as faceted navigation or search result display.

The plugin uses Tika content detection to assign the file type.

When to use this plugin

Use this plugin:

  • If you want to provide a file type facet, and want to be able to change the file types presented to users (e.g. Rich text document instead of rtf), or you wish to group several file types into a single item (e.g. Microsoft Word document instead of docx, doc, dot).

  • If you want to display the file type in your search result entries, and be able to control the file type shown to users.

  • If you need to group different file types together so that you can implement some conditional logic in your template based off this file type.

  • If you need to remove a set of documents matching a grouped file type from the search using a kill by query configuration, and you’re not easily able to exclude these documents by file name or extension.

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Human-readable document type tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

File type metadata field name

Configuration key

plugin.human-friendly-doc-type.config.metadata_name

Data type

string

Default value

documentType

Required

This setting is required

Defines the metadata field name where the document type metadata will be stored. This value must be mapped using the data source metadata mapping configuration.

File type to assign

Configuration key

plugin.human-friendly-doc-type.config.label.*

Data type

string

Required

This setting is optional

Defines the label that will be recorded when the detected file type matches the value in the parameter 1 field.

This only applicable when the plugin is configured not to use the raw TIKA output.

Enable raw TIKA output

Configuration key

plugin.human-friendly-doc-type.config.raw_output

Data type

boolean

Default value

false

Required

This setting is required

If enabled, TIKA output will be used without further translation.

Apply predefined mapping

Configuration key

plugin.human-friendly-doc-type.config.apply_predefined_label

Data type

boolean

Default value

true

Required

This setting is required

If enabled, the pre-defined labels (see plugin documentation) for detected file types will be used.

This only applicable when the plugin is configured not to use the raw TIKA output.

Default file type label

Configuration key

plugin.human-friendly-doc-type.config.custom_default_label

Data type

string

Default value

++

Required

This setting is optional

Defines the label that will be set when there is no mapping for the detected file type.

This only applicable when the plugin is configured not to use the raw TIKA output.

You must configure appropriate metadata mappings from the data source configure metadata mappings screen in order to use the file type labels that the plugin generates.

The metadata field that the plugin generates is configured as HTTP header type metadata.

All the configurable labels are only applicable if the plugin is configured to NOT using the raw TIKA output. If the plugin is configured to use the raw TIKA output, the labels will not be applied.

The CUSTOM label will override any pre-defined label for the same file types. The default label will be used for any file types that do not have a custom label defined or pre-defined label.

If you are crawling the PDF file, you must ensure that this plugin filter - com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter is running before TikaFilterProvider. TikaFilterProvider will convert the PDF binary file into text, and the plugin will not be able to detect the file type correctly if it runs after TikaFilterProvider.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter

Drag the com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

Example: Assign the pre-defined and some custom file type groupings

The following example will result in documents that have a content type of application/json having a fileGroup metadata set to JSON Files, and documents with a content type of text/html having a fileGroup metadata set to Web Content. All other content types will be translated accordingly to the mapping table provided above.

If the file type does not have a mapping defined, the raw value discovered by Tika will be used.

Configuration key name Parameter 1 Value

File type metadata class name

fileGroup

File type to assign

application/json

JSON Files

File type to assign

text/html

Web content

Enable raw TIKA output

false

Apply predefined mapping

true

Example: Assign the default label and some custom file type groupings

The following example will result in documents with a content type of text/html having a fileGroup metadata set to Web Content and the rest of the files will be assigned to the default label Other Files.

Configuration key name

Parameter 1

Value

File type to assign

text/html

Web content

Enable raw TIKA output

false

Apply predefined mapping

false

Default file type label

Other Files

Example: Assign the raw detected file types

This configuration stores the detected content type in a metadata class, rawFileType.

Configuration key name Value

File type metadata class name

rawFileType

Enable raw TIKA output

true

If the file type metadata class already exists, the detected value will not override the first value. The field will hold the existing value as well as the value detected by this plugin.

Change log

[1.2.0]

Added

  • Added support for defining a default file type label, set for documents that do not match a more specific rule.

  • Added an option to apply a pre-defined set of labels.

Changed

  • Upgraded Tika core dependency from v2.7.0 to 3.2.2.

  • Upgraded Junit dependency from v4.13.2 to 5.11.4.

[1.1.0]

Changed

  • Updated to the latest version plugin framework (Funnelback shared v16.20) to enable integration with the new plugin management dashboard.