Plugin: Human-readable document type

Purpose

Use this plugin if you need to curate the file type names that are recorded for documents, if presenting file type information in facets or result summaries.

The file type metadata produced by the crawler does not produce nice labels for detected filetypes, or apply any logical groupings when assigning the type.

If used for faceted navigation, this results is a file type facet that is not very useful, resulting in choices like text/html, doc, docx.

This plugin produces an alternate file type metadata field that is much better suited to use in faceted navigation and search results display, and also includes additional intelligence when detecting the file types. The plugin also provides configuration options that allows you to override and define file type labels and groupings.

By default, the plugin uses the discovered type to attach sensible labels and group the types as follows:

Discovered by Tika Saved in metadata

text/html

HTML

application/json

JSON

application/xml

XML

text/csv

CSV

image/jpeg

JPEG

application/vnd.ms-excel

Excel

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Excel

application/msword

Word

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Word

application/vnd.ms-powerpoint

PowerPoint

application/vnd.openxmlformats-officedocument.presentationml.slideshow

PowerPoint

application/pdf

PDF

application/zip

ZIP

application/rtf

RTF

The Discovered by Tika value is equivalent to the Parameter 1 value when setting the File type to assign configuration option. The Saved in metadata value is equivalent to value you define for the File type to assign configuration option. e.g. the first entry in this table is equivalent to a plugin configuration key of:

Configuration key name Parameter 1 Value

File type to assign

text/html

HTML

Users can override any type and/or add definitions to the table.

The file type metadata can then be used in subsequent filters, or for use in front-end functionality such as faceted navigation or search result display.

The plugin uses the Tika detector to assign the file type.

When to use this plugin

Use this plugin:

  • If you want to provide a file type facet, and want to be able to change the file types presented to users (e.g. Rich text document instead of rtf), or you wish to group several file types into a single item (e.g. Microsoft Word document instead of docx, doc, dot).

  • If you want to display the file type in your search result entries, and be able to control the file type shown to users.

  • If you need to group different file types together so that you can implement some conditional logic in your template based off this file type.

  • If you need to remove a set of documents matching a grouped file type from the search using a kill by query configuration, and you’re not easily able to exclude these documents by file name or extension.

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Human-readable document type tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

File type metadata class name

Configuration key

plugin.human-friendly-doc-type.config.metadata_name

Data type

string

Default value

documentType

Required

This setting is optional

Defines the metadata class name where the document type metadata will be stored.

File type to assign

Configuration key

plugin.human-friendly-doc-type.config.label.*

Data type

string

Required

This setting is optional

Defines the label that will be recorded when the detected file type matches the value in the parameter 1 field.

Enable raw TIKA output

Configuration key

plugin.human-friendly-doc-type.config.row_output

Data type

boolean

Default value

false

Required

This setting is optional

If enabled, TIKA output will be used without further translation.

It is recommended that you add a metadata mapping for the file type metadata class that you have defined in the plugin configuration so that the field is shown with other configured metadata when viewing the metadata configuration. Setting up a mapping also allows you to control if the metadata is treated as index or display only metadata. When configuring a metadata field, ensure you use the same class name as you have set in the plugin configuration, and set the metadata source name to something like human-friendly-filetype-plugin. The name here doesn’t really matter and is just a label that reminds you where the metadata in this class is coming from.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter

Drag the com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

Example: Assign the pre-defined and some custom file type groupings

The following example will result in documents that have a content type of application/json having a fileGroup metadata set to JSON Files, and documents with a content type of text/html having a fileGroup metadata set to Web Content. All other content types will be translated accordingly to the mapping table provided above.

If the file type does not have a mapping defined, the raw value discovered by Tika will be used.

Configuration key name Parameter 1 Value

File type metadata class name

fileGroup

File type to assign

application/json

JSON Files

File type to assign

text/html

Web content

Example: Assign the raw detected file types

This configuration stores the detected content type in a metadata class, rawFileType.

Configuration key name Value

File type metadata class name

rawFileType

Enable raw TIKA output

true

If the file type metadata class already exists, the detected value will not override the first value. The field will hold the existing value as well as the value detected by this plugin.

Change log

[1.1.0]

Changed

  • Updated to the latest version plugin framework (Funnelback shared v16.20) to enable integration with the new plugin management dashboard.