Funnelback logo

Documentation

CATEGORY

External metadata

Introduction

External metadata is metadata which can be applied to pages in a web collection, without actually modifying the pages in any way. For example, it is possible to make all pages in a particular website match the query c:partner by adding a single line to the external metadata file.

site.com/ c:partner

Targets of the metadata are identified by their URL as supplied by the webserver when the page was originally crawled. Note: The external metadata mechanism was designed for use only with collections containing URL data (i.e. Web collections).

Metadata information can be supplied for any of the allowable metadata classes (a-z, A-Z, 0-9), but care should be taken if reusing classes which have special behaviour (i.e. d is used for dates).

Activate External Metadata

To activate the external metadata for a collection, create the file external_metadata.cfg in the collection's conf subdirectory. The file can be created through the Funnelback administration interface using the file-manager.

When indexing commences external_metadata.cfg is checked for validity and data structures are set up to enable efficient lookup. If an error is detected, an appropriate error message will be printed, scanning of the file will cease and the documents will be indexed without external metadata. (See index.log using the "Browse Log Files" option in the administration interface.)

External metadata file format

The external metadata file must be a text file delimited into lines by linefeed (octal 012) characters. Each line consists of a URL-prefix followed by a list of metadata elements which apply to all URLs which start with that prefix (unless overridden by a more specific URL-prefix).

URL prefixes must include a full hostname. It is permissible to commence the prefix with "http://". If no protocol is specified then "http://" is assumed.

Each metadata element consists of a metadata field specifier (a single lowercase letter followed by a colon) followed by a word or a string of text in double quotes. Metadata elements are separated by whitespace. Punctuation should only be present within quoted strings. eg. s:"Hell's Bells" p:CSIRO. t: metadata will be used as the document title.

Here is an example of an external metadata file:

www.anu.edu.au w:ACT
www.anu.edu.au/chem/ x:knowledge x:science x:chemistry
www.anu.edu.au/physics/ x:"knowledge science physics" y:external y:funding
www.anu.edu.au/physics/molecular/ x:knowledge x:science x:chemistry

These records have the effect that:

  1. Any page within the www.anu.edu.au site, eg. www.anu.edu.au/~jim/index.html or www.anu.edu.au/physics/staff.htm will be indexed with act_w,
  2. Any page within www.anu.edu.au/chem/, eg. www.anu.edu.au/chem/chem.htm or www.anu.edu.au/chem/x/y/p.pdf will be indexed with knowledge_x, science_x and chemistry_x (as well as act_w).
  3. Any page within www.anu.edu.au/physics/molecular/, eg. www.anu.edu.au/physics/molecular/index.htm will inherit act_w from the first record, knowledge_x, science_x and chemistry_x from the fourth and external_y and funding_y from the third. It will NOT inherit physics_x from the third record because the x metadata from the third record has been over-ridden by the x metadata in the fourth.

Note: If multiple lines in the metadata file start with an identical prefix, only the first will be effective.

External Metadata Date Format

Two types of date metadata can be applied using the external metadata mechanism: document date ('d' field) and document expiry date ('X' field). Dates must be specified as 8 digit integers in the format YYYYMMDD. For example:

http://www.xxx.gov.au/notices/2004/ d:20040101 X:20041231

specifies that documents in the nominated web directory were all published on 01 Jan 2004 and cease to be valid on 01 Jan 2005.

See also

top ⇑