External metadata enables metadata to be applied to documents by the indexer by adding additional metadata, provided via configuration, based on the document’s URL. For example, it is possible to make all pages in a particular website match the query genre:comedy by adding a single line to the external metadata file.
Targets of the metadata are identified by their URL as supplied by the web server when the page was originally crawled.
Metadata information can be supplied for any of the allowable metadata classes, but care should be taken if reusing reserved or special classes which have special behaviour (i.e.
d is used for dates).
To supply external metadata for a data source, create an
external_metadata.cfg from the data source . The file can be created through the Funnelback administration interface using the Perl file manager.
When indexing commences,
external_metadata.cfg is checked for validity and data structures are set up to enable efficient lookup. If an error is detected, an appropriate error message will be printed, scanning of the file will cease and the documents will be indexed without external metadata. (See
Step-Index.log using the browse log files option in the administration dashboard.)
The external metadata file must be a text file delimited into lines by linefeed (
0x0A) characters. Each line consists of a URL-prefix followed by a list of metadata elements which apply to all URLs which start with that prefix (unless overridden by a more specific URL-prefix).
URL prefixes must include a full hostname. It is permissible to commence the prefix with "http://". If no protocol is specified then "http://" is assumed.
|If you find the URL matching for external metadata to be too restrictive you can use the add metadata to URL plugin to add metadata based on advanced pattern matching against the URL.|
Each metadata element consists of a metadata field specifier, followed by a colon, followed by a word or a string of text in double quotes. Metadata elements are separated by whitespace. Punctuation should only be present within quoted strings. e.g.
genre:"Historical Drama" director:Costner.
t:"example title" metadata will be used as the document title.
Below is an example of an external metadata file:
www.example.org publisher:"Movies Inc." www.example.org/comedy/ genre:Comedy www.example.org/historical-drama/ genre:"Historical Drama" www.example.org/comedy/romance/ genre:Romance year:2012
These records have the effect that:
Any page within the
www.example.org/about.htmwill be indexed with the metadata "publisher" = "Movies Inc."
Any page within
www.example.org/comedies/x/y/z.pdfwill be indexed with "genre" = "Comedy" and "publisher" = "Movies Inc."
Any page within
www.example.org/comedy/romance/movies.htmwill be indexed with the metadata "genre" = "Romance", "year" = "2012" and "publisher" = "Movies Inc.". The "Comedy" genre will not be inherited from the second line because it was overridden in the fourth line.
|If multiple lines in the metadata file start with an identical prefix, only the last will be effective.|
URL default pages (e.g. index) will be stripped from both the prefix given in the external metadata file and the URL being checked from the collection. When a URL prefix is stripped exact matching will be applied, hence
example.org/index would match any default page on
example.org, but would not match other pages or sub-directories on example.org. Default pages are defined as any page called index, welcome, home, default or main followed by a dot and a three or four character extension (e.g. htm or html).
Metadata classes from external metadata don’t need to have explicit metadata mappings defined and will be automatically created, however these metadata classes will be assumed to be text type, for display only. It is recommended that the metadata classes be defined thus explicitly setting the types for each class.
external_metadata.cfg might include something like:
A metadata mapping should be configured as follows (if the mapping is not already defined):
metadata class name: Genre
metadata class type: (as appropriate)
Source: add a custom source with a fake name, such as `EXTERNAL_METADATA_Genre
The indexer actually ignores source when the metadata comes from external metadata. Only the metadata class name and type are taken into account.
Dates must be specified as 8 digit integers in the format
YYYYMMDD. The document internal date is mapped to the
d metadata, any other date field will be treated as a string (unless configured otherwise in the metadata mappings):
http://www.example.org/movies/2004/ d:20040101 maxDate:20041231
This would result in the documents under
http://www.example.org/movies/2004/ to have an internal date of 20040101, and an additional metadata field "expiryDate" = "20041231".
The external metadata fetcher plugin can be used to source an
external_metadata.cfg from an external URL.
External metadata can be edited using via the Funnelback admin API.
To see and interact with the available API calls access the to
view API UI option from the
system menu within the administration dashboard.
The individual API calls are documented in the
External metadata section.