External metadata

External metadata enables metadata to be applied to documents by the indexer by adding additional metadata, provided via configuration, based on the document’s URL. For example, it is possible to make all pages in a particular website match the query genre:comedy by adding a single line to the external metadata file.

external_metadata.cfg
www.example.org/ genre:comedy

Targets of the metadata are identified by their URL as supplied by the web server when the page was originally crawled.

Metadata information can be supplied for any of the allowable metadata classes, but care should be taken if reusing reserved or special classes which have special behaviour (i.e. d is used for dates).

Supply external metadata

To supply external metadata for a data source, create an external_metadata.cfg from the data source . The file can be created through the search dashboard using the Configuration file manager.

When indexing commences, external_metadata.cfg is checked for validity and data structures are set up to enable efficient lookup. If an error is detected, an appropriate error message will be printed, scanning of the file will cease and the documents will be indexed without external metadata. (See Step-Index.log using the browse log files option in the search dashboard.)

External metadata file format

The external metadata file must be a text file delimited into lines by linefeed (\n, hex 0x0A) characters. Each line consists of a URL-prefix followed by a list of metadata elements which apply to all URLs which start with that prefix (unless overridden by a more specific URL-prefix).

URL prefixes must include a full hostname. It is permissible to commence the prefix with "http://", "https://" or "www".

If you find the URL matching for external metadata to be too restrictive you can use the add metadata to URL plugin to add metadata based on advanced pattern matching against the URL.

Each metadata element consists of a metadata field specifier, followed by a colon, followed by a word or a string of text in double quotes. Metadata elements are separated by whitespace. Punctuation should only be present within quoted strings. e.g. genre:"Historical Drama" director:Costner. t:"example title" metadata will be used as the document title.

Blank lines within the external metadata file are ignored.

The external metadata file format does not support any comments.

Below is an example of an external metadata file:

external_metadata.cfg
www.example.org publisher:"Movies Inc."
www.example.org/comedy/ genre:Comedy
www.example.org/historical-drama/ genre:"Historical Drama"
www.example.org/comedy/romance/ genre:Romance year:2012

These records have the effect that:

  1. Any page within the www.example.org site, e.g. www.example.org/movies/ or www.example.org/about.htm will be indexed with the metadata "publisher" = "Movies Inc."

  2. Any page within www.example.org/comedy/, e.g. www.example.org/comedy/movies.htm or www.example.org/comedies/x/y/z.pdf will be indexed with "genre" = "Comedy" and "publisher" = "Movies Inc."

  3. Any page within www.example.org/comedy/romance e.g. www.example.org/comedy/romance/movies.htm will be indexed with the metadata "genre" = "Romance", "year" = "2012" and "publisher" = "Movies Inc.". The "Comedy" genre will not be inherited from the second line because it was overridden in the fourth line.

If multiple lines in the external metadata file start with an identical prefix, only the last valid entry will be effective.

Default page handling

URL default pages (e.g. index) will be stripped from both the prefix given in the external metadata file and the URL being checked from the collection. When a URL prefix is stripped exact matching will be applied, hence example.org/index would match any default page on example.org, but would not match other pages or subdirectories on example.org. Default pages are defined as any page called index, welcome, home, default or main followed by a dot and a three or four character extension (e.g. htm or html).

Metadata mapping and types

Metadata classes from external metadata don’t need to have explicit metadata mappings defined and will be automatically created, however these metadata classes will be assumed to be text type, for display only. It is recommended that the metadata classes be defined thus explicitly setting the types for each class.

e.g. an external_metadata.cfg might include something like:

external_metadata.cfg
http://www.example.org/ Genre:"Comedy"

A metadata mapping should be configured (from the metadata mapping screen) as follows (if the mapping is not already defined):

  • metadata class name: Genre

  • metadata class type: (as appropriate)

  • source: add a custom source with a 'fake' name, such as EXTERNAL_METADATA_Genre. The source is actually ignored when the metadata comes from external metadata. However, we recommend including a 'fake' source value that reminds you of the metadata’s source.

External metadata validation and error handling

The following validation and error handling is applied to external metadata that is supplied via an external_metadata.cfg. This does not apply to external metadata supplied via the external metadata API, or using the external metadata fetcher plugin.

The indexer will ignore some minor errors in the external metadata file when building the index.

An invalid entry is any external metadata entry (a line in the file) where any error is detected. An invalid entry will not be processed and any metadata configured for the URL by the entry will not index for the URL.

Once a certain amount of errors are detected the indexer will fail the update. This failure is based on the calculation of an error rate., which is based on the number of valid vs. invalid entries in the external metadata file.

i.e. the number of lines containing errors divided by the total number of lines containing metadata rules in the external metadata.

If the calculated error rate is higher than a given configurable threshold value (set by the -externalMetaErrorThreshold option), the update fails. Default error rate is 10%.

The following are examples of minor errors in the external metadata:

external_metadata.cfg
www.example.org type "Movies" (1)
www.example.org type:Movies Inc (2)
www.example.org/comedy genre: Comedy (3)
www.example.org/historical-drama/ genregenregenregenregenregenregenregenregenregenregenregenre:"Drama" (4)
www.example.org/historical-drama/historical-drama/historical-drama...(more than 4,096 characters) genre:Comedy (5)
www.example.org type:"Movies (6)
example.org type:"Movies (7)
type:Movies genre:Comedy (8)
1 Missing the colon between the metadata key and value.
2 Whitespace exists in the metadata value. The value must be enclosed in double quotes if it contains any whitespace.
3 Whitespace exists between the metadata key with colon and value.
4 The metadata class name is too long, more than 64 characters.
5 The url is too long.
6 Missing the trailing double quotation mark.
7 Invalid URL, it should start with http://, https:// or www..
8 Missing the URL.
A space is used as the delimiter by the indexer. Any values that contain spaces must be enclosed in double quotes.

The following issues are not counted as errors.

  • The external metadata file is empty. The indexer ignores the file and logs a warning.

  • The URL in the external metadata file does not contain any metadata. The indexer ignores the file and logs a warning. This entry also does not count towards the error rate calculation.

  • A value contains a trailing double quote but not leading double quote. e.g.

    external_metadata.cfg
    www.example.org type:Movies" (missing the leading double quotation mark)

    In this case, type:Movies" will be index as type:Movies.

Values must be enclosed in double quotes if they contain any whitespace. If there is no whitespace in the value then double quotes are option. e.g. Genre:Action and Genre:"Action" are both valid (and equivalent) external metadata values.

External metadata date format

Dates must be specified as 8 digit integers in the format YYYYMMDD, and must be associated with the d metadata class. Any other class name will interpret the value as whatever is defined as the type in the metadata mapping configuration, or as a string if it is not defined.

external_metadata.cfg
http://www.example.org/movies/2004/ d:20040101 maxDate:20041231

This would result in the documents under http://www.example.org/movies/2004/ to have an internal date of 20040101, and an additional metadata field "expiryDate" = "20041231".

Fetching external metadata from a URL

The external metadata fetcher plugin can be used to source an external_metadata.cfg from an external URL.

External metadata API

External metadata can be edited using via the Funnelback admin API.

To see and interact with the available API calls access the to view API UI option from the system menu within the search dashboard.

The individual API calls are documented in the External metadata section.