Implementer training - metadata
What is metadata?
Metadata is information about a document. Metadata can come from a number of sources - it can be defined in HTML <meta>
tags within a web page (such as description
metadata), other HTML tags (such as the document’s <title>
element), or inferred from other information about an item (such as the size of the document in bytes, or the URL of the document).
In Funnelback metadata is also used for fielded information such as those sourced from XML, JSON, CSV or databases.
Why use metadata?
Metadata can be used in a number of different ways by Funnelback to enhance your search results and provide users with an enhanced search experience.
In general metadata can be used in the following ways:
-
To provide additional keyword data for the purposes of improving ranking
-
To provide information that can be used for display purposes (such as for custom result summaries)
-
As a way of classifying documents, allowing for enhanced functionality (such as filtering search results by fields of particular values)
-
As additional information that Funnelback can use for other features (such as to generate structured auto-completion).
Metadata can also be incorporated from an external source (such as a database) by producing external metadata. External metadata is covered in a later exercise.
The screenshot below illustrates how metadata can be used to enhance your search results. For this search the metadata values contained within tag metadata are used to produce faceted navigation that allows the search results to be filtered based on the selected tags. A metadata field containing a URL to a thumbnail image is also used to add a thumbnail to each search result that has a defined thumbnail image.
The amount of enhancement to the results that can be achieved is really only limited by the metadata you have available and could be used for example to conditionally display things in the search results. For example:
-
If it’s a video then open a Javascript lightbox to play the video when you click on the link.
-
Display a curated description for items where description metadata has been authored.
-
Display some extra fielded information about each item (type, location, position information)
-
Provide a small map for each result that contains geospatial metadata plotting the item on the map and populating an info bubble.
-
Display file type icons next to each binary file search result.
Metadata classes and mapping
Metadata is read by Funnelback as part of the process of building the search indexes. The source and type of any metadata that should be included needs to be configured before it will be indexed.
The configuration involves identifying the metadata source (e.g. a metadata field name), assigning a type and defining a mapping to a Funnelback metadata class. Once this has been configured, metadata will become available in the search index.
Metadata classes
Metadata classes are used by Funnelback to organise and access metadata. Metadata classes are given a unique identifier and type and one or more metadata sources are mapped to each class.
Metadata classes are used when configuring other Funnelback features that use metadata such as faceted navigation or contextual navigation.
Mapping metadata sources to classes
Metadata can come from a number of different sources, and these sources must be mapped to a metadata class in order for the metadata to be searchable by Funnelback.
Funnelback currently supports the following sources of metadata:
-
HTML meta tags such as
<meta name="dc.title" content="This is the title">
. -
HTML tags such as
<h1>
or<title>
. Note: CSS-class style selectors are not supported. -
HTTP headers such as
Content-Type
orX-Funnelback-Custom-Header
. -
XML X-Paths such as
/books/book/title
or//author
. Note: not all X-Paths are supported.
A metadata source can only be mapped to a single Funnelback metadata class, however multiple metadata fields can be mapped to the same Funnelback metadata class. This is useful if there are different fields that contain the same functional information (e.g. dc.description
, dcterms.description
, description
).
Metadata sources are grouped into two types:
-
HTML type sources (HTML
<meta>
tags, HTML tags and HTTP headers) are only considered when indexing web pages and PDF documents. -
XML type sources are only considered when indexing an XML document.
Funnelback uses XML to represent most non-web data source types such as database and LDAP directory records, social media content, JSON and CSV files. The XML source configuration is required when configuring metadata for these. |
Metadata class types
Funnelback supports five types of metadata classes:
-
Text: The content of this class is a string of text.
-
Geospatial x/y coordinate: The content of this field is a decimal latlong value in the following format:
geo-x;geo-y
(e.g.2.5233;-0.95
) This type should only be used if there is a need to perform a geospatial search (e.g. This point is within X km of another point). If the geospatial coordinate is only required for plotting items on a map then a text type field is sufficient. -
Number: The content of this field is a numeric value. Funnelback will interpret this as a number. This type should only be used if there is a need to use numeric operators when performing a search (e.g.
X > 2050
). If the field is only required for display within the search results a text type metadata class is sufficient. -
Document permissions: The content of this field is a security lock string defining the document permissions. This type should only be used when working with an enterprise search that includes document level security.
-
Date: A single metadata class supports a date, which is used as the document’s date for the purpose of relevance and date sorting. Additional dates for the purpose of display can be indexed as either a text or number type metadata class.
Metadata class search behavior
Text-type metadata is included in the search index for two main reasons - and this affects how the metadata is considered when a user makes a search.
-
Display only: The contents of this field will be indexed but is not considered as document content by Funnelback and will not influence the ranking when making a query. The value can be used for display purposes. It can be searched over only when the field is explicitly defined in the search query (e.g.
author:shakespeare
). -
Searchable as content: The contents of this field will be considered a part of the document content by Funnelback. Queries will match within this field and the value will contribute to the document’s ranking. The contents of this field will also be automatically considered for spelling and simple auto-completion suggestions. As for display-only metadata the field can be explicitly searched and the content of this field can also be used for display purposes.
Configuring metadata mappings
Metadata mappings are configured for a data source, with search packages and results pages inheriting the metadata classes from all the attached data sources.
Metadata mappings in Funnelback are defined using the search dashboard configure metadata mappings tool, which can be accessed via the data source configuration .
The metadata mapping configuration is accessed from the settings panel of the data source management screen.
Run an initial update of your data source before configuring the metadata mappings. This is because the indexer auto-detects available metadata sources when it scans the content. This in turn makes setting up the metadata mappings a lot simpler because the detected classes can be picked from list of available sources. |
The metadata mappings screen lists the metadata mappings that are defined for the current data source. Funnelback’s default configuration also includes commonly-used metadata from HTML and binary file content. Metadata fields from the Dublin Core, AGLS and Open Graph standards have pre-defined mappings as part of the default configuration.
The metadata mappings listing also provides information on the number of documents where the metadata is detected and allows the list to be filtered by keyword, or metadata content type (HTML/XML).
Additional mappings are created by clicking the add new button, and existing mappings can be modified by clicking on the metadata class name in the listing. The metadata mapping editor screen allows configuration of the mapping, allowing multiple metadata sources to be mapped to a single metadata class.
The metadata sources screen allows presents a list of metadata sources that were detected during indexing as well as commonly used HTML tags. In addition, metadata sources can be manually added as a source.
Tutorial: Configuring metadata
-
Log in to the search dashboard where you are doing your training.
See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial. -
Locate the silent films search package.
-
Run a search for death against the silent films search results page. A set of results similar to those below will be returned. The results are returned in using Funnelback’s default search results template.
-
Click on the search results for The car of death. The page will load in your browser.
-
View the page source and inspect the meta tags in the head region of the document.
-
Note down any metadata fields that are listed that contain information that might be useful for the search. Useful information is information that you might use in the search results (e.g. a good description, or URL to a thumbnail image) or keyword metadata (that can be used to contribute to the discoverability of the document). Inspect a few other result pages as well to see if the same fields exist in other pages and if they also seem useful. In the above example the open graph (
og
) tags appear to have useful content that could add value to the search results listing. -
Log in to the search dashboard and open the silent films - website data source management screen.
-
Open the metadata mappings configuration by selecting configure metadata mappings from the settings section.
-
The currently defined mappings are listed. The default list contains pre-defined mappings of commonly used HTML metadata fields. For many searches this will be sufficient without adding any additional mappings.
-
The search result summaries could be enhanced by utilizing text sourced from the title, description and image metadata - from tags that we identified (
og:title
,og:description
andog:image
). Confirm that theog:title
.og:description
andog:image
metadata sources are mapped to classes by typing each into the filter metadata mappings box above the metadata listing. Because these tags all appear in the current set of mappings, the metadata has already been included in the search index and is available for use in the search results. -
In order to use metadata in your search results template four conditions need to be satisfied:
-
The metadata source needs to exist in the content that is being indexed. (e.g. the HTML page needs to have meta tags populated with values).
-
The metadata source needs to be mapped to a Funnelback metadata class.
-
The metadata class needs to be included in the list of results page summary fields to display (if this is not set anywhere it will default to returning the
author
,c
,publisher
andkeyword
metadata classes) and the results page must not have summaries disabled (SM=off
). -
The template needs to be configured to print out the metadata appropriately.
-
-
Make a note of the class names to which the
og:title
.og:description
andog:image
sources are mapped. You will need these for future steps. -
Return to the search results for the search for death and view the JSON response (reminder: edit the URL so the
search.html
readssearch.json
). Inspect the results element that sits underneath the response element and observe the available result item fields and how they are used in the HTML search results page. -
Inspect the result for the car of death (we previously inspected the HTML source for suitable metadata tags. Observe all the fields that are available within the JSON (or XML) response. Each of these fields can be returned within the result set. Available metadata is returned within the
listMetadata
sub element, which is currently displaying metadata from thec
metadata class. Only metadata in thec
,author
,keyword
orpublisher
metadata classes will be displayed by default. Recall that there the list of metadata classes returned by Funnelback can be controlled using the summary fields (SF
) display option. -
Return to the results page management screen for silent films search and select edit results page configuration from the customize section. Add a query_processor_options parameter (which sets the display options) and add a summary fields option. The summary fields value (
SF
) which takes a comma separated list of fields as the parameter value defines the fields to return (the default value is-SF=[author,c,publisher,keyword]
). TheSF
parameter also supports regex expressions that match the metadata field classes. Checking our earlier notes we find thatc
(description),t
(title) and image (thumbnail URL) are the internal classes used for the fields that we are interested in. Add-SF=[c,t,image]
to the query processor options.the order of classes in the list is not important, but the comma separated list must sit within square brackets. Save the changes. -
Return to the JSON output and refresh the page. Observe that
c
,t
andimage
metadata values are now returned. You may notice that thet
value appears to be duplicated - this is because there are multiple metadata fields mapped to the same class (e.g.title
andog:title
), and that value is duplicated across those fields. The image class also contains multiple values as there are multiple fields mapped to the same class (og:image
andtwitter:image
).
Fields containing multiple values
Metadata fields can hold multiple values - this occurs automatically if multiple sources are mapped to the same metadata class and exist in the document being indexed (e.g. description
and dc.description
may map to the same metadata class, and both fields may appear in an HTML document). When multiple values are stored inside a metadata field they are automatically delimited with a vertical bar character (|
).
A single metadata field within an HTML document can also hold multiple values and the delimiter often varies and may use commas or semicolons. If you have control over the metadata field you should update the CMS templates that generate the metadata fields to use a vertical bar for optimal behavior in Funnelback.
It is possible to change the delimiter used for a specific metadata field using the metadata delimiters plugin, available from the extensions screen. |
Spelling suggestions and simple auto-completion
Metadata field content is used for spelling and simple auto-completion suggestions in the following circumstances:
-
Text type metadata that is searchable as content is added to the general page content and will be considered for spelling and simple auto-complete suggestions if threshold conditions are met.
-
Text metadata that is display only can also be added as an explicit metadata source for spelling and auto-complete suggestions by setting the
spelling.suggestion_sources
data source configuration option.
Metadata sources for contextual navigation
Metadata field content can also be used as a source for contextual navigation (or related search) suggestions.
By default, the keyword
and c
metadata classes are considered for contextual navigation suggestions. This list of fields can be adjusted by changing the contextual_navigation.summary_fields
list within the contextual navigation settings.
The settings are editable from within the search package management screen, by selecting edit data source configuration and choosing contextual navigation from the sidebar (or see the previous exercise on configuring contextual navigation).
Tutorial: Customize metadata field mappings
In the previous example we discovered that the default mappings for both the title and image result in duplication in the metadata as a result of the website containing variants of the tags with the same value.
In this exercise we will alter the mappings so that we select one of the fields as the one that will be used for printing and map it to a unique field. We will use the og:title
, og:description
, og:image
and also the og:video
values as the fields for display.
-
Log in to the search dashboard where you are doing your training.
See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial. -
Locate the silent films search package.
-
Open the data source management screen for the silent films - website data source.
-
Edit the metadata mappings (by selecting configure metadata mappings from the settings section).
-
Locate the entry for
og:description
by using the keyword filter. Metadata sources can only be mapped to a single metadata class, so it needs to be removed from the existing mapping before being set up as a new class. -
Edit the
c
metadata class entry by clicking on the row. Five different sources are mapped to thec
metadata class - remove the mapping forog:description
by locating it in the sources list and clicking on the delete icon that appears when you hover over the row. After removing the source click the save mapping button to update the configuration. -
You are returned to the listing screen. Create a new class called
displayText
by clicking the add new button. The field should be created as a text type field (it is a description) and be searchable as content (this field contains information that’s useful to consider as content). Click the add new button from the sources panel. This opens a window that shows all the detected metadata sources that have not been mapped to anything, along with controls to filter the list. Locate theog:description
source and tick the checkbox under the add source column. Multiple sources can be added by checking multiple boxes. Click the save button to add the source. -
Optionally add a description (describing the metadata mapping) before clicking add new mapping.
-
Locate the entry for
og:title
. Update it so that it’s mapped to a class calleddisplayTitle
. -
Locate the entry for
og:image
. Update it so that it’s mapped to a class calleddisplayImage
. -
There isn’t an existing entry for
og:video
. Add a mapping forog:video
to a class namedvideoUrl
. Set this as display only metadata, noting that the value of the URL doesn’t provide any useful information for the purposes of ranking the document. Click the save button once you are done. -
The index will need to be rebuilt because the metadata mappings have changed. This is done by running an advanced update to rebuild the index. Click on the menu:[silent films - website] breadcrumb tail item to return to the data source management screen, then click the start advanced update item in the update panel.
-
Select the rebuild live index option then click the update button.
-
Once the re-index has completed update the display options so that
displayTitle
,displayText
,displayImage
andvideoUrl
are returned in the search results, in addition to the existing metadata fields. (Hint: amend theSF
parameter for the query processor options on the silent films search results page to something like-SF=[c,t,image,displayTitle,displayText,displayImage,videoUrl]
). -
Search for death again and inspect the JSON or XML output and confirm that the metadata values are returned in the result metadata.
Tutorial: Add metadata to search result summaries
In this exercise the template will be configured to print the metadata in the search result summary if available.
-
Log in to the search dashboard where you are doing your training.
See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial. -
Locate the silent films search package.
-
Load the results page management screen for the silent films search results page.
-
Edit the template (by selecting edit results page templates from the templates panel)
-
The default template will load in the editor. This template includes a lot more Freemarker than in previous examples which was stripped back to the bare bones required to format the search results. It is designed to handle most of Funnelback features if the relevant data is returned in the data model. Locate the results block and add the following code to print the metadata. Insert the code immediately after the
<cite>
tag containing the document URL then click the save button (approx line 530).<#if s.result.listMetadata["displayTitle"]?first??><p>Display title: ${s.result.listMetadata["displayTitle"]?first}</p></#if> <#if s.result.listMetadata["displayText"]?first??><p>Display text: ${s.result.listMetadata["displayText"]?first}</p></#if> <#if s.result.listMetadata["displayImage"]?first??><p><img src="${s.result.listMetadata["displayImage"]?first}"/></p></#if>
-
Rerun the search for death and confirm that the metadata and thumbnails are appearing.
Review questions
-
Can you map a metadata field to multiple metadata classes?
-
Why would you want to map different metadata fields to the same Funnelback metadata class?
Extended exercises: metadata
-
Change the result formatting so that the thumbnail is displayed to the left of the text.
-
Map the video duration and add this information to the search results summary.
-
Hyperlink the thumbnail image so that it loads the video when clicked upon.
-
Alter the video link to load the video using a JavaScript lightbox.
-
Update the summary fields (
SF
) expression to use a regular expression to match all the metadata classes that start with display. Hint: the expression to use isdisplay.+
.
Configuring external metadata
External metadata provides administrators with a method of attaching metadata from an external source to documents within a search index based on URL rules.
The metadata defined in the external metadata file is attached to all URLs that start with a specific URL base. This can be used to quickly apply some metadata structure to a site with a human friendly URL scheme.
Any metadata applied this way is accessed in the same manner as standard metadata configured via the metadata mappings configuration tool once the index is built.
Tutorial: External metadata
-
Log in to the search dashboard where you are doing your training.
See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial. -
Locate the silent films search package.
-
Load the data source management screen for the silent films - website data source.
-
Navigate to the file manager (select manage data source configuration files from the settings panel).
-
Create an
external_metadata.cfg
file by clicking the add new button, selectingexternal_metadata.cfg
from the drop-down menu, then clicking the save button. -
Edit the external_metadata.cfg by clicking on the file name. A blank file editor screen will load.
External metadata will be defined for a set of specific URLs.
The first line attaches common metadata to all the documents in the data source - type=Silent film. This is applied to all documents that have a URL starting with
https://archive.org/details
(which is every document in this data source).The second line attaches common metadata of director=Charlie Chaplin to all URLs that start with
https://archive.org/details/CC_
which is a subset of films featuring Charlie Chaplin.The rest of the lines attach specific metadata to the URLs listed - again all these URLs are treated as left matching - but as they are specific the pattern will only match a single URL.
Copy and paste the following into the editor and click the save button:
https://archive.org/details filmType:"Silent film" https://archive.org/details/CC_ director:"Charlie Chaplin" https://archive.org/details/Downhill_1927 director:"Alfred Hitchcock" year:1927 language:English runningTime:82 https://archive.org/details/EasyVirtue1928 director:"Alfred Hitchcock" year:1928 language:English runningTime:85 https://archive.org/details/CannedHarmony director:"Alice Guy Blaché" year:1912 runningTime:13 https://archive.org/details/TheBurstrupHomesMurderCase director:"Alice Guy Blaché" year:1911 runningTime:18
-
When defining new fields as external metadata it is good practice to ensure any external metadata fields have metadata classes defined in the metadata class definitions. This allows you to see what fields are defined for the data source and more importantly to tell the indexer what type of metadata the field values contain. Edit the metadata mappings (select edit metadata mappings from the settings panel) and add the following definitions to the mappings:
Metadata class name Source Type filmType
EXTERNAL_METADATA_filmType
Text (display only)
director
EXTERNAL_METADATA_director
Text (Searchable as content)
year
EXTERNAL_METADATA_year
Text (display only)
runningTime
EXTERNAL_METADATA_runningTime
Text (display only)
language
EXTERNAL_METADATA_language
Text (display only)
When defining the mappings ensure the class name exactly matches the field name (including any capitals) used in the external_metadata.cfg
.When adding the field mapping add the class as for a normal metadata field, but when adding the source(s) to the mapping you will need to type in the source name instead of picking from a list of detected sources. The actual value you type here doesn’t really matter - best practice is to use a label similar to
EXTERNAL_METADATA_<field_name>
where<field_name>
corresponds to the field from the external metadata file. e.g. setting up a mapping for thefilmType
field (appearing asfilmType:value
in the external metadata file) would look similar to:Observe the typed value of the name (
filmType
):Add metadata entries for the other fields sourced from external metadata. You will find that language is already defined - for this field add the source to the existing field instead of creating a new field.
-
Rebuild the index. (Select start advanced update from the update panel, then select re-index live view and click the update button).
-
Add the new metadata fields to the display options configured for the results page. Add the new metadata classes to the summary fields (
-SF=[displayText,displayTitle,displayImage,videoUrl,filmType,director,runningTime,language,year]
(Edit results page configuration option from the customize panel). -
Run a search for hitchcock and view the JSON page source and confirm that additional metadata is returned in the data model with the result summaries (edit the URL to change
search.html
tosearch.json
).