Metadata classes and mapping
Metadata is read by Funnelback as part of the process of building the search indexes. The source and type of any metadata that should be included needs to be configured before it will be indexed. The configuration involves identifying the metadata source (e.g. a metadata field name), assigning a type and defining a mapping to a Funnelback metadata class. Once this has been configured, metadata will become available in the search index.
Metadata standards
There are many schemas for representing metadata. Funnelback has pre-configured support for the following metadata standards and HTML pages containing the meta tags from the following schemas will be automatically detected with the default configuration.
-
Basic web page metadata (author, keywords, description).
Metadata classes
Metadata classes are used by Funnelback to organise and access metadata. Metadata classes are given a unique identifier and type and one or more metadata sources are mapped to each class.
Metadata classes are used when configuring other Funnelback features that use metadata such as faceted navigation, contextual navigation
Metadata class configuration
Metadata classes are configured using the customize metadata mappings function in the search dashboard. The tool allows for metadata classes to be created, edited and deleted, and for metadata sources to be mapped to the metadata classes.
Mapping metadata sources to classes
Metadata can come from a number of different sources, and these sources must be mapped to a metadata class in order for the metadata to be searchable by Funnelback.
Funnelback currently supports the following sources of metadata:
-
HTML meta tags such as
<meta name="dc.title" content="This is the title">
-
HTML tags such as
<h1>
or<title>
-
HTTP headers such as
Content-Type
orX-Funnelback-Custom-Header
-
XML X-Paths such as
/books/book/title
,//author
or//attributes@isbn
. Note: not all X-Paths are supported.
A metadata source can only be mapped to a single Funnelback metadata class, however multiple metadata fields can be mapped to the same Funnelback metadata class. This is done if there are different fields that should all be considered the same thing (e.g. dc.description, dcterms.description, description).
Metadata sources are grouped into two types:
-
HTML type sources (HTML meta tags, HTML tags and HTTP headers) are only considered when indexing web pages and PDF documents. HTML source names are case insensitive (so DCTERMS.subject and dcterms.subject are both matched).
-
XML type sources are only considered when indexing an XML document. XML source names are case sensitive (so /item/subject is not the same as /item/Subject).
Funnelback uses XML to represent most non-web collection types such as database and LDAP directory records, social media content, JSON and CSV files. The XML source configuration is required when configuring metadata for these. |
Metadata class types
Funnelback supports five types of metadata classes:
-
Text: The content of this class is a string of text.
-
Number: The content of this field is a numeric value. Funnelback will interpret this as a number. This type should only be used if there is a need to use numeric operators when performing a search (e.g.
X > 2050
). If the field is only required for display within the search results a text type metadata class is sufficient. -
Date: Funnelback supports a single date class and will use the values mapped to this class to determine a date for the document for the purpose of ranking, sorting and also date range search. If additional dates are required they should be configured as either text or number type metadata classes.
-
Geospatial x/y coordinate: The content of this field is a decimal latlong value in the following format: geo-x;geo-y (e.g. 40.6976684;-74.260555) This type should only be used if there is a need to perform a geospatial search (e.g. This point is within X km of another point). If the geospatial coordinate is only required for plotting items on a map then a text type metadata class is sufficient.
-
Document permissions: The content of this field is a security lock string defining the document permissions. This type should only be used when working with an enterprise collection that includes document level security and specifies the requirement of a document permissions metadata field.
Metadata class search behaviour
Text-type metadata is included in the search index for two main reasons — and this affects how the metadata is considered when a user makes a search:
-
Display only: The contents of this field will be indexed but is not considered as document content by Funnelback and will not influence the ranking when making a query. The value can be used for display purposes. It can be searched over only when the field is explicitly defined in the search query (e.g.
author:shakespeare
). Display only metadata will not be included in spelling suggestions unless spelling.suggestion_sources is configured to include the class. -
Searchable as content: The contents of this field will be considered a part of the document content by Funnelback. Queries will match within this field and the value will contribute to the document’s ranking. The contents of this field will also be automatically considered for spelling and simple auto-completion suggestions. As for display-only metadata the field can be explicitly searched and the content of this field can also be used for display purposes.
Metadata field size limits
By default the indexer limits the size of a metadata field to 2048 characters. Metadata fields larger than this will be truncated. An indexer option -mdsfml
can be used to control this value. See: indexer options for more information.
Metadata can also be truncated at query time if the metadata buffer is too small. The -MBL
query processor option controls the size of the metadata buffer. See: query processor options for more information.
Reserved and special classes
Reserved classes
The following metadata classes are reserved for internal use, and should not normally be used for other purposes.
Class ID | Function |
---|---|
|
Outgoing link target information. |
|
Image information ( |
|
Anchor text referring to the document (text within |
|
Email addresses within the document ( |
|
URL hostname information. |
|
URL path and filename information. |
|
User click information referring to the document. |
|
Any metadata class that starts with Fun or any upper or lower case variation is also reserved. These classes are used for internal functionality such as content and accessibility auditing |
Special classes
The following metadata classes are treated specially by Funnelback. It may be appropriate to map metadata into them, but they will be treated differently internally as described below.
Metadata class | Explanation | Default mappings |
---|---|---|
|
Used for document date information. Date sources will be used when the document date is displayed and for recency related ranking. |
|
|
Used for file format information. Will be used as the original type of a file (e.g. HTML, PDF, Word Document) where this information is displayed. Note: if you require a human-friendly result type to be indexed for documents, use the human-readable document type plugin. |
|
|
Used for title information. The first title found will be used when the document title is displayed and all title content will be up-weighted by default in ranking. For html documents the title in |
|
|
(At query time) match anywhere. In any metadata field or in the page content. |
N/A |
Predefined classes
Funnelback has predefined the following classes, this allows Funnelback to look for some metadata within html documents and display this data on the search results page without heavy customization.
Class ID | Type | Behaviour | Explanation | Metadata fields included |
---|---|---|---|---|
|
text |
display |
|
|
|
text |
display |
|
|
|
text |
content |
Description |
|
|
text |
display |
|
|
|
date |
Date |
|
|
|
text |
display |
Format |
|
|
text |
display |
|
|
|
text |
display |
Document MD5 |
|
|
text |
display |
Availability/identifier |
|
|
text |
display |
Image thumbnail URL |
|
|
text |
content |
|
|
|
text |
display |
|
|
|
text |
display |
Legal mandate |
|
|
text |
display |
|
|
|
text |
display |
|
|
|
text |
display |
|
|
|
text |
display |
|
|
|
text |
content |
Title |
|
|
text |
display |
|