Plugin: Human-readable document type
Purpose
Use this plugin if you need to curate the file type names that are recorded for documents, if presenting file type information in facets or result summaries.
The file type metadata produced by the crawler does not produce nice labels for detected filetypes, or apply any logical groupings when assigning the type.
If used for faceted navigation, this results is a file type facet that is not very useful, resulting in choices like text/html, doc, docx.
This plugin produces an alternate file type metadata field that is much better suited to use in faceted navigation and search results display, and also includes additional intelligence when detecting the file types. The plugin also provides configuration options that allows you to override and define file type labels and groupings.
By default, the plugin uses the discovered type to attach sensible labels and group the types as follows:
Discovered by Tika | Saved in metadata |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Discovered by Tika value is equivalent to the Parameter 1 value when setting the File type to assign configuration option. The Saved in metadata value is equivalent to value you define for the File type to assign configuration option. e.g. the first entry in this table is equivalent to a plugin configuration key of:
|
Users can override any type and/or add definitions to the table.
The file type metadata can then be used in subsequent filters, or for use in front-end functionality such as faceted navigation or search result display.
The plugin uses the Tika detector to assign the file type. |
When to use this plugin
Use this plugin:
-
If you want to provide a file type facet, and want to be able to change the file types presented to users (e.g. Rich text document instead of rtf), or you wish to group several file types into a single item (e.g. Microsoft Word document instead of docx, doc, dot).
-
If you want to display the file type in your search result entries, and be able to control the file type shown to users.
-
If you need to group different file types together so that you can implement some conditional logic in your template based off this file type.
-
If you need to remove a set of documents matching a grouped file type from the search using a kill by query configuration, and you’re not easily able to exclude these documents by file name or extension.
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Human-readable document type tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
File type metadata class name
Configuration key |
|
Data type |
string |
Default value |
|
Required |
This setting is optional |
Defines the metadata class name where the document type metadata will be stored.
File type to assign
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the label that will be recorded when the detected file type matches the value in the parameter 1 field.
Enable raw TIKA output
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is optional |
If enabled, TIKA output will be used without further translation.
It is recommended that you add a metadata mapping for the file type metadata class that you have defined in the plugin configuration so that the field is shown with other configured metadata when viewing the metadata configuration. Setting up a mapping also allows you to control if the metadata is treated as index or display only metadata. When configuring a metadata field, ensure you use the same class name as you have set in the plugin configuration, and set the metadata source name to something like human-friendly-filetype-plugin . The name here doesn’t really matter and is just a label that reminds you where the metadata in this class is coming from.
|
Filter chain configuration
This plugin uses filters which are used to apply transformations to the gathered content.
The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.
Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation. |
Filter classes
This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter
Drag the com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter plugin filter to where you wish it to run in the filter chain sequence.
Examples
Example: Assign the pre-defined and some custom file type groupings
The following example will result in documents that have a content type of application/json
having a fileGroup
metadata set to JSON Files
,
and documents with a content type of text/html
having a fileGroup
metadata set to Web Content
. All other content types will be translated accordingly to the mapping table provided above.
If the file type does not have a mapping defined, the raw value discovered by Tika will be used.
Configuration key name | Parameter 1 | Value |
---|---|---|
File type metadata class name |
|
|
File type to assign |
|
|
File type to assign |
|
|
Example: Assign the raw detected file types
This configuration stores the detected content type in a metadata class, rawFileType
.
Configuration key name | Value |
---|---|
File type metadata class name |
|
Enable raw TIKA output |
true |
If the file type metadata class already exists, the detected value will not override the first value. The field will hold the existing value as well as the value detected by this plugin. |