Plugin: Human-readable document type

Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin.

Purpose

Plugin creates a metadata field with a human-readable filetype value. It uses the Tika detector: https://tika.apache.org/2.6.0/detection.html to find out the proper file type. Discovered type is translated accordingly to a predefined mapping table:

Discovered by TIKA Saved in metadata

text/html

HTML

application/json

JSON

application/xml

XML

text/csv

CSV

image/jpeg

JPEG

application/vnd.ms-excel

Excel

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Excel

application/msword

Word

application/vnd.openxmlformats-officedocument.wordprocessingml.document

Word

application/vnd.ms-powerpoint

PowerPoint

application/vnd.openxmlformats-officedocument.presentationml.slideshow

PowerPoint

application/pdf

PDF

application/zip

ZIP

application/rtf

RTF

Users can override any type and/or add a new one to the table.
Created metadata can be used by implementers for filtering.

Usage

Enable the plugin

Enable the human-friendly-doc-type plugin on your data source from the Extensions screen in the administration dashboard or add the following data source configuration to enable the plugin.

plugin.human-friendly-doc-type.enabled=true
plugin.human-friendly-doc-type.version=1.0.0

Add to the filter chain:

filter.classes=<OTHER-FILTERS>:com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter:<OTHER-FILTERS>
The filter should be placed at an appropriate position in the filter chain. In most circumstances this should be located towards the end of the filter chain. NOTE: The plugin will take effect after a full update of the data source.

Plugin configuration settings

The following options can be set in the data source configuration to configure the plugin:

  • plugin.human-friendly-doc-type.config.metadata_name=<METADATA NAME>: (Optional) Name of the metadata that will hold the document type. Default name:documentType.

  • plugin.human-friendly-doc-type.config.row_output=[true/false]: (Optional) If enabled, TIKA output will be used without further translation. Default:false.

  • plugin.human-friendly-doc-type.config.label.<TIKA FILE TYPE>=<human readable label>: (Optional) Defines (creates new or overwrites existing) translation for discovered <TIKA FILE TYPE> to <human readable label>.

Example

The following example will result in content type application/json being provided in metadata myMetaDataName as JSON Files
and text/html as Web Content. All other results will be translated accordingly to the mapping table provided before. If the file type does not have mapping defined (not in the table, nor in the collection.cfg), the plain value discovered by TIKA will be used.

plugin.human-friendly-doc-type.config.metadata_name=myMetaDataName
plugin.human-friendly-doc-type.config.label.application/json=JSON Files
plugin.human-friendly-doc-type.config.label.text/html=Web Content

The following example will result in the content type discovered by TIKA being provided in metadata cType without any further translation.

plugin.human-friendly-doc-type.config.metadata_name=cType
plugin.human-friendly-doc-type.config.row_output=true
If metadata already exists, the second (injected) value will not override the first value. Instead, we will have two values in the resulting map.

All versions of human-friendly-doc-type-filter