Plugin: Human-readable document type
Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin. |
Purpose
Plugin creates a metadata field with a human-readable filetype value. It uses the Tika detector: https://tika.apache.org/2.6.0/detection.html to find out the proper file type. Discovered type is translated accordingly to a predefined mapping table:
Discovered by TIKA | Saved in metadata |
---|---|
text/html |
HTML |
application/json |
JSON |
application/xml |
XML |
text/csv |
CSV |
image/jpeg |
JPEG |
application/vnd.ms-excel |
Excel |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Excel |
application/msword |
Word |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
Word |
application/vnd.ms-powerpoint |
PowerPoint |
application/vnd.openxmlformats-officedocument.presentationml.slideshow |
PowerPoint |
application/pdf |
|
application/zip |
ZIP |
application/rtf |
RTF |
Users can override any type and/or add a new one to the table.
Created metadata can be used by implementers for filtering.
Usage
Enable the plugin
Enable the human-friendly-doc-type plugin on your data source from the Extensions screen in the administration dashboard or add the following data source configuration to enable the plugin.
plugin.human-friendly-doc-type.enabled=true
plugin.human-friendly-doc-type.version=1.0.0
Add
filter.classes=<OTHER-FILTERS>:com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter:<OTHER-FILTERS>
The |
Plugin configuration settings
The following options can be set in the data source configuration to configure the plugin:
-
plugin.human-friendly-doc-type.config.metadata_name=<METADATA NAME>
: (Optional) Name of the metadata that will hold the document type. Default name:documentType
. -
plugin.human-friendly-doc-type.config.row_output=[true/false]
: (Optional) If enabled, TIKA output will be used without further translation. Default:false
. -
plugin.human-friendly-doc-type.config.label.<TIKA FILE TYPE>=<human readable label>
: (Optional) Defines (creates new or overwrites existing) translation for discovered<TIKA FILE TYPE>
to<human readable label>
.
Example
The following example will result in content type application/json being provided in metadata myMetaDataName as JSON Files
and text/html as Web Content. All other results will be translated accordingly to the mapping table provided before.
If the file type does not have mapping defined (not in the table, nor in the collection.cfg), the plain value discovered by TIKA will be used.
plugin.human-friendly-doc-type.config.metadata_name=myMetaDataName plugin.human-friendly-doc-type.config.label.application/json=JSON Files plugin.human-friendly-doc-type.config.label.text/html=Web Content
The following example will result in the content type discovered by TIKA being provided in metadata cType without any further translation.
plugin.human-friendly-doc-type.config.metadata_name=cType plugin.human-friendly-doc-type.config.row_output=true
If metadata already exists, the second (injected) value will not override the first value. Instead, we will have two values in the resulting map. |