Plugin: Human-readable document type
| Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin. |
Purpose
Plugin creates a metadata field with a human-readable filetype value. It uses the Tika detector: https://tika.apache.org/2.6.0/detection.html to find out the proper file type.
Discovered type is translated accordingly to a predefined mapping table:
| Discovered by TIKA | Saved in metadata |
|---|---|
text/html |
HTML |
application/json |
JSON |
application/xml |
XML |
text/csv |
CSV |
image/jpeg |
JPEG |
application/vnd.ms-excel |
Excel |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
Excel |
application/msword |
Word |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
Word |
application/vnd.ms-powerpoint |
PowerPoint |
application/vnd.openxmlformats-officedocument.presentationml.slideshow |
PowerPoint |
application/pdf |
|
application/zip |
ZIP |
application/rtf |
RTF |
Users can override any type and/or add a new one to the table.
Created metadata can be used by implementers for filtering.
Usage
Enable the plugin
Enable the human-friendly-doc-type plugin on your data source from the plugins screen in the search dashboard or add the following data source configuration to enable the plugin.
plugin.human-friendly-doc-type.enabled=true
plugin.human-friendly-doc-type.version=1.0.0
Add com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter to the filter chain:
filter.classes=<OTHER-FILTERS>:com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter:<OTHER-FILTERS>
The com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter filter should be placed at an appropriate position in the filter chain. In most circumstances this should be located towards the end of the filter chain.
|
| The plugin will take effect after a full update of the data source. |
Plugin configuration settings
The following options can be set in the data source configuration to configure the plugin:
-
plugin.human-friendly-doc-type.config.metadata_name=<METADATA NAME>: (Optional) Name of the metadata that will hold the document type. Default name:documentType. -
plugin.human-friendly-doc-type.config.row_output=[true/false]: (Optional) If enabled, TIKA output will be used without further translation. Default:false. -
plugin.human-friendly-doc-type.config.label.<TIKA FILE TYPE>=<human readable label>: (Optional) Defines (creates new or overwrites existing) translation for discovered<TIKA FILE TYPE>to<human readable label>.
Example
The following example will result in content type application/json being provided in metadata myMetaDataName as JSON Files
and text/html as Web Content. All other results will be translated accordingly to the mapping table provided before.
If the file type does not have mapping defined (not in the table, nor in the collection.cfg), the plain value discovered by TIKA will be used.
plugin.human-friendly-doc-type.config.metadata_name=myMetaDataName plugin.human-friendly-doc-type.config.label.application/json=JSON Files plugin.human-friendly-doc-type.config.label.text/html=Web Content
The following example will result in the content type discovered by TIKA being provided in metadata cType without any further translation.
plugin.human-friendly-doc-type.config.metadata_name=cType plugin.human-friendly-doc-type.config.row_output=true
| If metadata already exists, the second (injected) value will not override the first value. Instead, we will have two values in the resulting map. |