Plugin: Human-readable document type
Purpose
Use this plugin if you need to curate the file type names that are recorded for documents, if presenting file type information in facets or result summaries.
The file type metadata produced by the crawler does not produce nice labels for detected filetypes, or apply any logical groupings when assigning the type.
If used for faceted navigation, this results is a file type facet that is not very useful, resulting in choices like text/html, doc, docx.
This plugin produces an alternate file type metadata field that is much better suited to use in faceted navigation and search results display, and also includes additional intelligence when detecting the file types. The plugin also provides configuration options that allows you to override and define file type labels and groupings.
By default, the plugin uses the discovered type to attach sensible labels and group the types as follows:
Discovered by Tika | Saved in metadata |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Discovered by Tika value is equivalent to the Parameter 1 value when setting the File type to assign configuration option. The Saved in metadata value is equivalent to value you define for the File type to assign configuration option. e.g. the first entry in this table is equivalent to a plugin configuration key of:
|
Users can override any type and/or add definitions to the table.
The file type metadata can then be used in subsequent filters, or for use in front-end functionality such as faceted navigation or search result display.
The plugin uses Tika content detection to assign the file type. |
When to use this plugin
Use this plugin:
-
If you want to provide a file type facet, and want to be able to change the file types presented to users (e.g. Rich text document instead of rtf), or you wish to group several file types into a single item (e.g. Microsoft Word document instead of docx, doc, dot).
-
If you want to display the file type in your search result entries, and be able to control the file type shown to users.
-
If you need to group different file types together so that you can implement some conditional logic in your template based off this file type.
-
If you need to remove a set of documents matching a grouped file type from the search using a kill by query configuration, and you’re not easily able to exclude these documents by file name or extension.
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Human-readable document type tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
File type metadata field name
Configuration key |
|
Data type |
string |
Default value |
|
Required |
This setting is required |
Defines the metadata field name where the document type metadata will be stored. This value must be mapped using the data source metadata mapping configuration.
File type to assign
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the label that will be recorded when the detected file type matches the value in the parameter 1 field.
This only applicable when the plugin is configured not to use the raw TIKA output.
Enable raw TIKA output
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is required |
If enabled, TIKA output will be used without further translation.
Apply predefined mapping
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is required |
If enabled, the pre-defined labels (see plugin documentation) for detected file types will be used.
This only applicable when the plugin is configured not to use the raw TIKA output.
Default file type label
Configuration key |
|
Data type |
string |
Default value |
|
Required |
This setting is optional |
Defines the label that will be set when there is no mapping for the detected file type.
This only applicable when the plugin is configured not to use the raw TIKA output.
You must configure appropriate metadata mappings from the data source configure metadata mappings screen in order to use the file type labels that the plugin generates. The metadata field that the plugin generates is configured as HTTP header type metadata. |
All the configurable labels are only applicable if the plugin is configured to NOT using the raw TIKA output. If the plugin is configured to use the raw TIKA output, the labels will not be applied. The CUSTOM label will override any pre-defined label for the same file types. The default label will be used for any file types that do not have a custom label defined or pre-defined label. |
If you are crawling the PDF file, you must ensure that this plugin filter - com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter is running before |
Filter chain configuration
This plugin uses filters which are used to apply transformations to the gathered content.
The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.
Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation. |
Filter classes
This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter
Drag the com.funnelback.plugins.humanfriendlydoctype.HumanFriendlyDocTypeFilter plugin filter to where you wish it to run in the filter chain sequence.
Examples
Example: Assign the pre-defined and some custom file type groupings
The following example will result in documents that have a content type of application/json
having a fileGroup
metadata set to JSON Files
,
and documents with a content type of text/html
having a fileGroup
metadata set to Web Content
. All other content types will be translated accordingly to the mapping table provided above.
If the file type does not have a mapping defined, the raw value discovered by Tika will be used.
Configuration key name | Parameter 1 | Value |
---|---|---|
File type metadata class name |
|
|
File type to assign |
|
|
File type to assign |
|
|
Enable raw TIKA output |
|
|
Apply predefined mapping |
|
Example: Assign the default label and some custom file type groupings
The following example will result in documents with a content type of text/html
having a fileGroup
metadata set to Web Content
and the rest of the files will be assigned to the default label Other Files
.
Configuration key name |
Parameter 1 |
Value |
File type to assign |
|
|
Enable raw TIKA output |
|
|
Apply predefined mapping |
|
|
Default file type label |
|
Example: Assign the raw detected file types
This configuration stores the detected content type in a metadata class, rawFileType
.
Configuration key name | Value |
---|---|
File type metadata class name |
|
Enable raw TIKA output |
true |
If the file type metadata class already exists, the detected value will not override the first value. The field will hold the existing value as well as the value detected by this plugin. |