Metadata classes and mapping

Metadata is read by Funnelback as part of the process of building the search indexes. The source and type of any metadata that should be included needs to be configured before it will be indexed. The configuration involves identifying the metadata source (e.g. a metadata field name), assigning a type and defining a mapping to a Funnelback metadata class. Once this has been configured, metadata will become available in the search index.

Metadata standards

There are many schemas for representing metadata. Funnelback has pre-configured support for the following metadata standards and HTML pages containing the meta tags from the following schemas will be automatically detected with the default configuration.

Metadata classes

Metadata classes are used by Funnelback to organise and access metadata. Metadata classes are given a unique identifier and type and one or more metadata sources are mapped to each class.

Metadata classes are used when configuring other Funnelback features that use metadata such as faceted navigation, contextual navigation

Metadata class configuration

Metadata classes are configured using the customize metadata mappings function in the search dashboard. The tool allows for metadata classes to be created, edited and deleted, and for metadata sources to be mapped to the metadata classes.

Metadata class ID

Metadata classes are named using a unique identifier consisting of an ASCII alphanumeric (A-Za-z0-9) strings up to 64 characters in length, which do NOT start with upper or lower case FUN. Funnelback has some predefined metadata classes which should be used when possible.

Mapping metadata sources to classes

Metadata can come from a number of different sources, and these sources must be mapped to a metadata class in order for the metadata to be searchable by Funnelback.

Funnelback currently supports the following sources of metadata:

  • HTML meta tags such as <meta name="dc.title" content="This is the title">

  • HTML tags such as <h1> or <title>

  • HTTP headers such as Content-Type or X-Funnelback-Custom-Header

  • XML X-Paths such as /books/book/title, //author or //attributes@isbn. Note: not all X-Paths are supported.

A metadata source can only be mapped to a single Funnelback metadata class, however multiple metadata fields can be mapped to the same Funnelback metadata class. This is done if there are different fields that should all be considered the same thing (e.g. dc.description, dcterms.description, description).

Metadata sources are grouped into two types:

  • HTML type sources (HTML meta tags, HTML tags and HTTP headers) are only considered when indexing web pages and PDF documents. HTML source names are case insensitive (so DCTERMS.subject and dcterms.subject are both matched).

  • XML type sources are only considered when indexing an XML document. XML source names are case sensitive (so /item/subject is not the same as /item/Subject).

Funnelback uses XML to represent most non-web collection types such as database and LDAP directory records, social media content, JSON and CSV files. The XML source configuration is required when configuring metadata for these.

Metadata class types

Funnelback supports five types of metadata classes:

  • Text: The content of this class is a string of text.

  • Number: The content of this field is a numeric value. Funnelback will interpret this as a number. This type should only be used if there is a need to use numeric operators when performing a search (e.g. X > 2050). If the field is only required for display within the search results a text type metadata class is sufficient.

  • Date: Funnelback supports a single date class and will use the values mapped to this class to determine a date for the document for the purpose of ranking, sorting and also date range search. If additional dates are required they should be configured as either text or number type metadata classes.

  • Geospatial x/y coordinate: The content of this field is a decimal latlong value in the following format: geo-x;geo-y (e.g. 40.6976684;-74.260555) This type should only be used if there is a need to perform a geospatial search (e.g. This point is within X km of another point). If the geospatial coordinate is only required for plotting items on a map then a text type metadata class is sufficient.

  • Document permissions: The content of this field is a security lock string defining the document permissions. This type should only be used when working with an enterprise collection that includes document level security and specifies the requirement of a document permissions metadata field.

Metadata class search behaviour

Text-type metadata is included in the search index for two main reasons — and this affects how the metadata is considered when a user makes a search:

  • Display only: The contents of this field will be indexed but is not considered as document content by Funnelback and will not influence the ranking when making a query. The value can be used for display purposes. It can be searched over only when the field is explicitly defined in the search query (e.g. author:shakespeare). Display only metadata will not be included in spelling suggestions unless spelling.suggestion_sources is configured to include the class.

  • Searchable as content: The contents of this field will be considered a part of the document content by Funnelback. Queries will match within this field and the value will contribute to the document’s ranking. The contents of this field will also be automatically considered for spelling and simple auto-completion suggestions. As for display-only metadata the field can be explicitly searched and the content of this field can also be used for display purposes.

Metadata field size limits

By default the indexer limits the size of a metadata field to 2048 characters. Metadata fields larger than this will be truncated. An indexer option -mdsfml can be used to control this value. See: indexer options for more information.

Metadata can also be truncated at query time if the metadata buffer is too small. The -MBL query processor option controls the size of the metadata buffer. See: query processor options for more information.

Reserved and special classes

Reserved classes

The following metadata classes are reserved for internal use, and should not normally be used for other purposes.

Class ID Function

h

Outgoing link target information.

i

Image information (alt and src attributes of img tags).

k

Anchor text referring to the document (text within a tags).

m

Email addresses within the document (a tags using mailto: in href attributes).

u

URL hostname information.

v

URL path and filename information.

K

User click information referring to the document.

Fun*

Any metadata class that starts with Fun or any upper or lower case variation is also reserved. These classes are used for internal functionality such as content and accessibility auditing

Special classes

The following metadata classes are treated specially by Funnelback. It may be appropriate to map metadata into them, but they will be treated differently internally as described below.

Metadata class Explanation Default mappings

d

Used for document date information. Date sources will be used when the document date is displayed and for recency related ranking.

dc.date (and qualifications like dc.data.published not mentioned thereafter), dc.date.modified, dc.date.created, dc.date.issued, Last Modified Date (from HTTP headers), dc.date.expires, dc.date.valid, in order of decreasing priority. See supported date formats for more information.

f

Used for file format information. Will be used as the original type of a file (e.g. HTML, PDF, Word Document) where this information is displayed. Note: if you require a human-friendly result type to be indexed for documents, use the human-readable document type plugin.

dc.format, funnelback.format, text/html

t

Used for title information. The first title found will be used when the document title is displayed and all title content will be up-weighted by default in ranking. For html documents the title in <title> will typically be preferred.

<title>, dc.title, DCTERMS.title, trim.title, og:title, twitter:title, <h1>, <h2>, <h3>, <h4>

*

(At query time) match anywhere. In any metadata field or in the page content.

N/A

Predefined classes

Funnelback has predefined the following classes, this allows Funnelback to look for some metadata within html documents and display this data on the search results page without heavy customization.

Class ID Type Behaviour Explanation Metadata fields included

audience

text

display

agls.audience, AGLSTERMS.audience, DCTERMS.audience

author

text

display

author, dc.author, dc.creator, DCTERMS.creator, dc.contributor, trim.authorloc, twitter:creator

c

text

content

Description

dc.description, DCTERMS.description, og:description, description, subject, twitter:description

coverage

text

display

dc.coverage, DCTERMS.Coverage

d

date

Date

AGLSTERMS.dateLicensed, dc.date, dc.date.issued, dc.date.created, dc.date.modified, dc.date.valid, dc.date.expires, dc.date.review, DCTERMS.date, date, lastSaved, creation-date, trim.datereg, dcterms.modified, article:published_time, article:modified_time, video:release_date

f

text

display

Format

dc.format, DCTERMS.format

function

text

display

agls.function, AGLSTERMS.function

H

text

display

Document MD5

X-Funnelback-MD5

identifier

text

display

Availability/identifier

dc.identifier, DCTERMS.identifier, trim.number, agls.availability, AGLSTERMS.availability

image

text

display

Image thumbnail URL

image, og:image, twitter:image, twitter:image0:src, twitter:image1:src, twitter:image2:src, twitter:image3:src

keyword

text

content

article:tag, keywords, dc.subject, DCTERMS.subject, video:tag

language

text

display

dc.language, DCTERMS.language, og:locale

mandate

text

display

Legal mandate

agls.mandate

publisher

text

display

dc.publisher, DCTERMS.publisher

relation

text

display

dc.relation, DCTERMS.relation

rights

text

display

dc.rights, DCTERMS.rights

source

text

display

dc.source, DCTERMS.source

t

text

content

Title

title, dc.title, DCTERMS.title, trim.title, og:title, twitter:title, <h1>, <h2>, <h3>, <h4>

type

text

display

dc.type, og:type, DCTERMS.type, twitter:card