Implementer training - metadata

What is metadata?

Metadata is information about a document. Metadata can come from a number of sources - it can be defined in HTML <meta> tags within a web page (such as description metadata), other HTML tags (such as the document’s <title> element), or inferred from other information about an item (such as the size of the document in bytes, or the URL of the document).

In Funnelback metadata is also used for fielded information such as those sourced from XML, JSON, CSV or databases.

Why use metadata?

Metadata can be used in a number of different ways by Funnelback to enhance your search results and provide users with an enhanced search experience.

In general metadata can be used in the following ways:

  • To provide additional keyword data for the purposes of improving ranking

  • To provide information that can be used for display purposes (such as for custom result summaries)

  • As a way of classifying documents, allowing for enhanced functionality (such as filtering search results by fields of particular values)

  • As additional information that Funnelback can use for other features (such as to generate structured auto-completion).

Metadata can also be incorporated from an external source (such as a database) by producing external metadata. External metadata is covered in a later exercise.

The screenshot below illustrates how metadata can be used to enhance your search results. For this search the metadata values contained within tag metadata are used to produce faceted navigation that allows the search results to be filtered based on the selected tags. A metadata field containing a URL to a thumbnail image is also used to add a thumbnail to each search result that has a defined thumbnail image.

why use metadata 01

The amount of enhancement to the results that can be achieved is really only limited by the metadata you have available and could be used for example to conditionally display things in the search results. For example:

  • If it’s a video then open a Javascript lightbox to play the video when you click on the link.

  • Display a curated description for items where description metadata has been authored.

  • Display some extra fielded information about each item (type, location, position information)

  • Provide a small map for each result that contains geospatial metadata plotting the item on the map and populating an info bubble.

  • Display file type icons next to each binary file search result.

Metadata classes and mapping

Metadata is read by Funnelback as part of the process of building the search indexes. The source and type of any metadata that should be included needs to be configured before it will be indexed.

The configuration involves identifying the metadata source (e.g. a metadata field name), assigning a type and defining a mapping to a Funnelback metadata class. Once this has been configured, metadata will become available in the search index.

Metadata classes

Metadata classes are used by Funnelback to organise and access metadata. Metadata classes are given a unique identifier and type and one or more metadata sources are mapped to each class.

Metadata classes are used when configuring other Funnelback features that use metadata such as faceted navigation or contextual navigation.

Mapping metadata sources to classes

Metadata can come from a number of different sources, and these sources must be mapped to a metadata class in order for the metadata to be searchable by Funnelback.

Funnelback currently supports the following sources of metadata:

  • HTML meta tags such as <meta name="dc.title" content="This is the title">.

  • HTML tags such as <h1> or <title>. Note: CSS-class style selectors are not supported.

  • HTTP headers such as Content-Type or X-Funnelback-Custom-Header.

  • XML X-Paths such as /books/book/title or //author. Note: not all X-Paths are supported.

A metadata source can only be mapped to a single Funnelback metadata class, however multiple metadata fields can be mapped to the same Funnelback metadata class. This is useful if there are different fields that contain the same functional information (e.g. dc.description, dcterms.description, description).

Metadata sources are grouped into two types:

  • HTML type sources (HTML <meta> tags, HTML tags and HTTP headers) are only considered when indexing web pages and PDF documents.

  • XML type sources are only considered when indexing an XML document.

Funnelback uses XML to represent most non-web data source types such as database and LDAP directory records, social media content, JSON and CSV files. The XML source configuration is required when configuring metadata for these.

Metadata class types

Funnelback supports five types of metadata classes:

  • Text: The content of this class is a string of text.

  • Geospatial x/y coordinate: The content of this field is a decimal latlong value in the following format: geo-x;geo-y (e.g. 2.5233;-0.95) This type should only be used if there is a need to perform a geospatial search (e.g. This point is within X km of another point). If the geospatial coordinate is only required for plotting items on a map then a text type field is sufficient.

  • Number: The content of this field is a numeric value. Funnelback will interpret this as a number. This type should only be used if there is a need to use numeric operators when performing a search (e.g. X > 2050). If the field is only required for display within the search results a text type metadata class is sufficient.

  • Document permissions: The content of this field is a security lock string defining the document permissions. This type should only be used when working with an enterprise search that includes document level security.

  • Date: A single metadata class supports a date, which is used as the document’s date for the purpose of relevance and date sorting. Additional dates for the purpose of display can be indexed as either a text or number type metadata class.

Metadata class search behavior

Text-type metadata is included in the search index for two main reasons - and this affects how the metadata is considered when a user makes a search.

  • Display only: The contents of this field will be indexed but is not considered as document content by Funnelback and will not influence the ranking when making a query. The value can be used for display purposes. It can be searched over only when the field is explicitly defined in the search query (e.g. author:shakespeare).

  • Searchable as content: The contents of this field will be considered a part of the document content by Funnelback. Queries will match within this field and the value will contribute to the document’s ranking. The contents of this field will also be automatically considered for spelling and simple auto-completion suggestions. As for display-only metadata the field can be explicitly searched and the content of this field can also be used for display purposes.

Configuring metadata mappings

Metadata mappings are configured for a data source, with search packages and results pages inheriting the metadata classes from all the attached data sources.

Metadata mappings in Funnelback are defined using the search dashboard configure metadata mappings tool, which can be accessed via the data source configuration .

The metadata mapping configuration is accessed from the settings panel of the data source management screen.

manage data source panel settings
Run an initial update of your data source before configuring the metadata mappings. This is because the indexer auto-detects available metadata sources when it scans the content. This in turn makes setting up the metadata mappings a lot simpler because the detected classes can be picked from list of available sources.

The metadata mappings screen lists the metadata mappings that are defined for the current data source. Funnelback’s default configuration also includes commonly-used metadata from HTML and binary file content. Metadata fields from the Dublin Core, AGLS and Open Graph standards have pre-defined mappings as part of the default configuration.

configuring metadata mappings 02

The metadata mappings listing also provides information on the number of documents where the metadata is detected and allows the list to be filtered by keyword, or metadata content type (HTML/XML).

Additional mappings are created by clicking the add new button, and existing mappings can be modified by clicking on the metadata class name in the listing. The metadata mapping editor screen allows configuration of the mapping, allowing multiple metadata sources to be mapped to a single metadata class.

configuring metadata mappings 03

The metadata sources screen allows presents a list of metadata sources that were detected during indexing as well as commonly used HTML tags. In addition, metadata sources can be manually added as a source.

configuring metadata mappings 04

Tutorial: Configuring metadata

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Locate the silent films search package.

  3. Run a search for death against the silent films search results page. A set of results similar to those below will be returned. The results are returned in using Funnelback’s default search results template.

    exercise configuring metadata 01
  4. Click on the search results for The car of death. The page will load in your browser.

    exercise configuring metadata 02
  5. View the page source and inspect the meta tags in the head region of the document.

    exercise configuring metadata 03
  6. Note down any metadata fields that are listed that contain information that might be useful for the search. Useful information is information that you might use in the search results (e.g. a good description, or URL to a thumbnail image) or keyword metadata (that can be used to contribute to the discoverability of the document). Inspect a few other result pages as well to see if the same fields exist in other pages and if they also seem useful. In the above example the open graph (og) tags appear to have useful content that could add value to the search results listing.

  7. Log in to the search dashboard and open the silent films - website data source management screen.

  8. Open the metadata mappings configuration by selecting configure metadata mappings from the settings section.

    manage data source panel settings
  9. The currently defined mappings are listed. The default list contains pre-defined mappings of commonly used HTML metadata fields. For many searches this will be sufficient without adding any additional mappings.

    exercise configuring metadata 05
  10. The search result summaries could be enhanced by utilizing text sourced from the title, description and image metadata - from tags that we identified (og:title, og:description and og:image). Confirm that the og:title. og:description and og:image metadata sources are mapped to classes by typing each into the filter metadata mappings box above the metadata listing. Because these tags all appear in the current set of mappings, the metadata has already been included in the search index and is available for use in the search results.

  11. In order to use metadata in your search results template four conditions need to be satisfied:

    • The metadata source needs to exist in the content that is being indexed. (e.g. the HTML page needs to have meta tags populated with values).

    • The metadata source needs to be mapped to a Funnelback metadata class.

    • The metadata class needs to be included in the list of results page summary fields to display (if this is not set anywhere it will default to returning the author, c, publisher and keyword metadata classes) and the results page must not have summaries disabled (SM=off).

    • The template needs to be configured to print out the metadata appropriately.

  12. Make a note of the class names to which the og:title. og:description and og:image sources are mapped. You will need these for future steps.

  13. Return to the search results for the search for death and view the JSON response (reminder: edit the URL so the search.html reads search.json). Inspect the results element that sits underneath the response element and observe the available result item fields and how they are used in the HTML search results page.

    exercise configuring metadata 06
  14. Inspect the result for the car of death (we previously inspected the HTML source for suitable metadata tags. Observe all the fields that are available within the JSON (or XML) response. Each of these fields can be returned within the result set. Available metadata is returned within the listMetadata sub element, which is currently displaying metadata from the c metadata class. Only metadata in the c, author, keyword or publisher metadata classes will be displayed by default. Recall that there the list of metadata classes returned by Funnelback can be controlled using the summary fields (SF) display option.

  15. Return to the results page management screen for silent films search and select edit results page configuration from the customize section. Add a query_processor_options parameter (which sets the display options) and add a summary fields option. The summary fields value (SF) which takes a comma separated list of fields as the parameter value defines the fields to return (the default value is -SF=[author,c,publisher,keyword]). The SF parameter also supports regex expressions that match the metadata field classes. Checking our earlier notes we find that c (description), t (title) and image (thumbnail URL) are the internal classes used for the fields that we are interested in. Add -SF=[c,t,image] to the query processor options.

    the order of classes in the list is not important, but the comma separated list must sit within square brackets. Save the changes.
    exercise configuring metadata 07
  16. Return to the JSON output and refresh the page. Observe that c, t and image metadata values are now returned. You may notice that the t value appears to be duplicated - this is because there are multiple metadata fields mapped to the same class (e.g. title and og:title), and that value is duplicated across those fields. The image class also contains multiple values as there are multiple fields mapped to the same class (og:image and twitter:image).

    exercise configuring metadata 08

Fields containing multiple values

Metadata fields can hold multiple values - this occurs automatically if multiple sources are mapped to the same metadata class and exist in the document being indexed (e.g. description and dc.description may map to the same metadata class, and both fields may appear in an HTML document). When multiple values are stored inside a metadata field they are automatically delimited with a vertical bar character (|).

A single metadata field within an HTML document can also hold multiple values and the delimiter often varies and may use commas or semicolons. If you have control over the metadata field you should update the CMS templates that generate the metadata fields to use a vertical bar for optimal behavior in Funnelback.

It is possible to change the delimiter used for a specific metadata field using the metadata delimiters plugin, available from the extensions screen.

Spelling suggestions and simple auto-completion

Metadata field content is used for spelling and simple auto-completion suggestions in the following circumstances:

  • Text type metadata that is searchable as content is added to the general page content and will be considered for spelling and simple auto-complete suggestions if threshold conditions are met.

  • Text metadata that is display only can also be added as an explicit metadata source for spelling and auto-complete suggestions by setting the spelling.suggestion_sources data source configuration option.

Metadata sources for contextual navigation

Metadata field content can also be used as a source for contextual navigation (or related search) suggestions. By default, the keyword and c metadata classes are considered for contextual navigation suggestions. This list of fields can be adjusted by changing the contextual_navigation.summary_fields list within the contextual navigation settings.

The settings are editable from within the search package management screen, by selecting edit data source configuration and choosing contextual navigation from the sidebar (or see the previous exercise on configuring contextual navigation).

Tutorial: Customize metadata field mappings

In the previous example we discovered that the default mappings for both the title and image result in duplication in the metadata as a result of the website containing variants of the tags with the same value.

In this exercise we will alter the mappings so that we select one of the fields as the one that will be used for printing and map it to a unique field. We will use the og:title, og:description, og:image and also the og:video values as the fields for display.

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Locate the silent films search package.

  3. Open the data source management screen for the silent films - website data source.

  4. Edit the metadata mappings (by selecting configure metadata mappings from the settings section).

  5. Locate the entry for og:description by using the keyword filter. Metadata sources can only be mapped to a single metadata class, so it needs to be removed from the existing mapping before being set up as a new class.

    exercise customise metadata field mappings 01
  6. Edit the c metadata class entry by clicking on the row. Five different sources are mapped to the c metadata class - remove the mapping for og:description by locating it in the sources list and clicking on the delete icon that appears when you hover over the row. After removing the source click the save mapping button to update the configuration.

    exercise customise metadata field mappings 02
  7. You are returned to the listing screen. Create a new class called displayText by clicking the add new button. The field should be created as a text type field (it is a description) and be searchable as content (this field contains information that’s useful to consider as content). Click the add new button from the sources panel. This opens a window that shows all the detected metadata sources that have not been mapped to anything, along with controls to filter the list. Locate the og:description source and tick the checkbox under the add source column. Multiple sources can be added by checking multiple boxes. Click the save button to add the source.

    exercise customise metadata field mappings 03
  8. Optionally add a description (describing the metadata mapping) before clicking add new mapping.

    exercise customise metadata field mappings 04
  9. Locate the entry for og:title. Update it so that it’s mapped to a class called displayTitle.

  10. Locate the entry for og:image. Update it so that it’s mapped to a class called displayImage.

  11. There isn’t an existing entry for og:video. Add a mapping for og:video to a class named videoUrl. Set this as display only metadata, noting that the value of the URL doesn’t provide any useful information for the purposes of ranking the document. Click the save button once you are done.

  12. The index will need to be rebuilt because the metadata mappings have changed. This is done by running an advanced update to rebuild the index. Click on the menu:[silent films - website] breadcrumb tail item to return to the data source management screen, then click the start advanced update item in the update panel.

    manage data source panel update
  13. Select the rebuild live index option then click the update button.

    exercise customise metadata field mappings 06
  14. Once the re-index has completed update the display options so that displayTitle, displayText, displayImage and videoUrl are returned in the search results, in addition to the existing metadata fields. (Hint: amend the SF parameter for the query processor options on the silent films search results page to something like -SF=[c,t,image,displayTitle,displayText,displayImage,videoUrl]).

    exercise customise metadata field mappings 07
  15. Search for death again and inspect the JSON or XML output and confirm that the metadata values are returned in the result metadata.

    exercise customise metadata field mappings 08

Tutorial: Add metadata to search result summaries

In this exercise the template will be configured to print the metadata in the search result summary if available.

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Locate the silent films search package.

  3. Load the results page management screen for the silent films search results page.

  4. Edit the template (by selecting edit results page templates from the templates panel)

  5. The default template will load in the editor. This template includes a lot more Freemarker than in previous examples which was stripped back to the bare bones required to format the search results. It is designed to handle most of Funnelback features if the relevant data is returned in the data model. Locate the results block and add the following code to print the metadata. Insert the code immediately after the <cite> tag containing the document URL then click the save button (approx line 530).

    <#if s.result.listMetadata["displayTitle"]?first??><p>Display title: ${s.result.listMetadata["displayTitle"]?first}</p></#if>
    <#if s.result.listMetadata["displayText"]?first??><p>Display text: ${s.result.listMetadata["displayText"]?first}</p></#if>
    <#if s.result.listMetadata["displayImage"]?first??><p><img src="${s.result.listMetadata["displayImage"]?first}"/></p></#if>
    exercise add metadata to search results 01
  6. Rerun the search for death and confirm that the metadata and thumbnails are appearing.

    exercise add metadata to search results 02

Review questions

  • Can you map a metadata field to multiple metadata classes?

  • Why would you want to map different metadata fields to the same Funnelback metadata class?

Extended exercises: metadata

  • Change the result formatting so that the thumbnail is displayed to the left of the text.

  • Map the video duration and add this information to the search results summary.

  • Hyperlink the thumbnail image so that it loads the video when clicked upon.

  • Alter the video link to load the video using a JavaScript lightbox.

  • Update the summary fields (SF) expression to use a regular expression to match all the metadata classes that start with display. Hint: the expression to use is display.+.

Configuring external metadata

External metadata provides administrators with a method of attaching metadata from an external source to documents within a search index based on URL rules.

The metadata defined in the external metadata file is attached to all URLs that start with a specific URL base. This can be used to quickly apply some metadata structure to a site with a human friendly URL scheme.

Any metadata applied this way is accessed in the same manner as standard metadata configured via the metadata mappings configuration tool once the index is built.

Tutorial: External metadata

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Locate the silent films search package.

  3. Load the data source management screen for the silent films - website data source.

  4. Navigate to the file manager (select manage data source configuration files from the settings panel).

  5. Create an external_metadata.cfg file by clicking the add new button, selecting external_metadata.cfg from the drop-down menu, then clicking the save button.

    exercise external metadata 01
  6. Edit the external_metadata.cfg by clicking on the file name. A blank file editor screen will load.

    External metadata will be defined for a set of specific URLs.

    The first line attaches common metadata to all the documents in the data source - type=Silent film. This is applied to all documents that have a URL starting with https://archive.org/details (which is every document in this data source).

    The second line attaches common metadata of director=Charlie Chaplin to all URLs that start with https://archive.org/details/CC_ which is a subset of films featuring Charlie Chaplin.

    The rest of the lines attach specific metadata to the URLs listed - again all these URLs are treated as left matching - but as they are specific the pattern will only match a single URL.

    Copy and paste the following into the editor and click the save button:

    https://archive.org/details filmType:"Silent film"
    https://archive.org/details/CC_ director:"Charlie Chaplin"
    https://archive.org/details/Downhill_1927 director:"Alfred Hitchcock" year:1927 language:English runningTime:82
    https://archive.org/details/EasyVirtue1928 director:"Alfred Hitchcock" year:1928 language:English runningTime:85
    https://archive.org/details/CannedHarmony director:"Alice Guy Blaché" year:1912 runningTime:13
    https://archive.org/details/TheBurstrupHomesMurderCase director:"Alice Guy Blaché" year:1911 runningTime:18
    exercise external metadata 02
  7. When defining new fields as external metadata it is good practice to ensure any external metadata fields have metadata classes defined in the metadata class definitions. This allows you to see what fields are defined for the data source and more importantly to tell the indexer what type of metadata the field values contain. Edit the metadata mappings (select edit metadata mappings from the settings panel) and add the following definitions to the mappings:

    Metadata class name Source Type

    filmType

    EXTERNAL_METADATA_filmType

    Text (display only)

    director

    EXTERNAL_METADATA_director

    Text (Searchable as content)

    year

    EXTERNAL_METADATA_year

    Text (display only)

    runningTime

    EXTERNAL_METADATA_runningTime

    Text (display only)

    language

    EXTERNAL_METADATA_language

    Text (display only)

    When defining the mappings ensure the class name exactly matches the field name (including any capitals) used in the external_metadata.cfg.

    When adding the field mapping add the class as for a normal metadata field, but when adding the source(s) to the mapping you will need to type in the source name instead of picking from a list of detected sources. The actual value you type here doesn’t really matter - best practice is to use a label similar to EXTERNAL_METADATA_<field_name> where <field_name> corresponds to the field from the external metadata file. e.g. setting up a mapping for the filmType field (appearing as filmType:value in the external metadata file) would look similar to:

    exercise external metadata 03

    Observe the typed value of the name (filmType):

    exercise external metadata 04

    Add metadata entries for the other fields sourced from external metadata. You will find that language is already defined - for this field add the source to the existing field instead of creating a new field.

  8. Rebuild the index. (Select start advanced update from the update panel, then select re-index live view and click the update button).

  9. Add the new metadata fields to the display options configured for the results page. Add the new metadata classes to the summary fields (-SF=[displayText,displayTitle,displayImage,videoUrl,filmType,director,runningTime,language,year] (Edit results page configuration option from the customize panel).

  10. Run a search for hitchcock and view the JSON page source and confirm that additional metadata is returned in the data model with the result summaries (edit the URL to change search.html to search.json).

    exercise external metadata 05