Metadata for binary documents in Squiz Matrix

Background

This article describes how to assign metadata to binary documents (such as PDFs) stored withing Squiz Matrix when indexing with Funnelback.

Details

Every site indexed by Funnelback should include asset listings that expose metadata that is stored within Squiz Matrix relating to binary documents (PDF, RTF and MS Office documents).

When a document is uploaded to Squiz Matrix metadata is captured about the document - title, author and so on. Users will expect that the information that was entered when uploading the document will be reflected in the search results.

When Funnelback crawls a Squiz Matrix site it follows all links it encounters and downloads the documents. For PDF/office documents an additional operation to extract the text from the document runs. This is all Funnelback has access to when creating the index.

The additional metadata entered when uploading a document to Squiz Matrix is stored within the Squiz Matrix database (and not within the metadata fields contained within the PDF or office document). This means that Funnelback won’t see these and the titles will be whatever happens to be stored in the document’s internal metadata - often something quite useless such as 'Template no. 3'.

This can be rectified by creating an asset listing that lists both the document’s URL and any associated metadata that you wish to include in the index. This listing is formatted in Funnelback’s external_metadata.cfg format.

Tutorial: Creating an external metadata asset listing

  1. Create an asset listing asset underneath the corresponding site asset in Squiz Matrix named external metadata. The asset listing should be created in a section of the site, where other Funnelback assets are stored, normally within type 2 folder asset titled Funnelback which sits amongst the first level of assets in a site. This will create and asset listing with a URL similar to: http://www.example.com/funnelback/external-metadata

  2. Configure the asset types and the root nodes of the asset listing, to target the file assets you would like external metadata applied to in your Funnelback collection’s index. Common practice is to target both the content site and media site as root nodes, and to select the following asset types:

    • PDF File

    • MS Excel Document

    • MS PowerPoint Document

    • MS Word Document

  3. Remove the contents of the empty results body copy, and ensure that the content type of all content containers in the asset listing have been set to raw html.

  4. In the default format of the asset listing, use the keyword to list the asset’s URL, then for each external metadata element you would like applied, use the following format:

    http://example.com/path/to/document.pdf Funnelback_metadata_class:"corresponding asset metadata keyword" Funnelback_metadata_class2:"corresponding asset metadata keyword2"
  5. Your default format will look similar to:

    %asset_url% c:"%asset_metadata_description%" s:"%asset_metadata_dc.subject%" d:"%asset_published_short%" <LINE BREAK>

    Resulting in output like:

    http://example.com/url1.html c:"This is the description" s:"keyword1|keyword2|keyword3" d:"2015-01-24"
    http://example.com/url2.html c:"This is another description" s:"keyword1|keyword2|keyword3" d:"2015-04-14"
    You should ensure that each of the printed fields are cleaned to remove any quotes and line breaks and that fields that contain multiple values are delimited with a vertical bar character '|'.
  6. Ensure the design parse file is configured to return the external metadata file as plain text. To set this up, create a design asset titled Plain text format, and on its parse file screen, enter the following mark up:

    <MySource_PRINT id_name="__global__" var="content_type" content_type="text/plain" />
    <MySource_AREA id_name="page_body" design_area="body" />
  7. Apply this design to the asset listing asset so that it is returned as plain text.

  8. Preview your external metadata listing. The listing should be returned as UTF-8 plain and should be similar to the example below. You may need to view the page source to see the text with line breaks. Dates should be formatted as short ISO-8601 dates (e.g. 2010-12-16)

    http://example.com/__data/assets/pdf_file/0009/10314/example-1.pdf description:"example description" keyword:"example subjects" d:"2018-05-24"
    http://example.com/__data/assets/pdf_file/0017/10466/example-2.pdf description:"example description" keyword:"example subjects" d:"2002-10-02"
    http://example.com/__data/assets/pdf_file/0003/10596/example-3.pdf description:"example description" keyword:"example subjects" d:"2008-03-03"
    http://example.com/__data/assets/pdf_file/0019/10945/example-4.pdf description:"example description" keyword:"example subjects" d:"2014-02-25"
  9. Ensure that access to the external metadata listing is disallowed in robots.txt. For the URL above (http://www.example.com/funnelback/external-metadata) you would add the following to your robots.txt:

    robots.txt
    Disallow: /funnelback/
  10. Log in to the Funnelback administration interface and switch to the collection that contains the Squiz Matrix site content.

  11. Review the metadata mappings for the Funnelback collection. Select the administer tab then click on edit metadata mappings.

  12. Ensure that each of the Funnelback metadata class fields from the external metadata file have corresponding definitions in the metadata mappings to ensure that the type for the metadata fields are explicitly defined. If the fields only exist via external metadata create entries in the metadata mappings and type a source field similar to EXTERNAL-METADATA:

  13. Edit the collection.cfg and add a pre-index workflow command to retrieve the external metadata from Squiz Matrix, and save it to the collection’s configuration folder so that the metadata is available to Funnelback when indexing. Add the following as a pre_index_command (update the URL to the external metadata listing in Squiz Matrix. $SEARCH_HOME and $COLLECTION_NAME can be left as these are special variables, similar to Matrix keyword modifiers, that will be filled in by Funnelback):

    collection.cfg
    pre_index_command=curl --connect-timeout 60 --retry 3 --retry-delay 20 'http://www.example.com/funnelback/external-metadata/_nocache' -o $SEARCH_HOME/conf/$COLLECTION_NAME/external_metadata.cfg
It is a good idea to access an uncached version of the URL when downloading the external metadata file to ensure that the external metadata includes all the latest information.
Funnelback expects the external metadata configuration file to be valid. If the file includes any errors in formatting the update will fail.

Common external metadata errors

Syntax errors in the external_metadata.cfg file

Funnelback expects any external metadata configuration to be valid. If any errors are detected indexing will fail with an error similar to the following appearing in the Step-Index.log:

Step-Index.log
Using external metadata from /opt/funnelback/data/shakespeare/live/tmp/external_metadata.cfg6294184464704914621
Error: Missing double quote at line 1
Ext. metadata requested but not available in required form
 - taking early retirement.
Command finished with exit code: 1

The error message will attempt to provide information on the cause.

Common causes of syntax errors include:

  • Unclosed quoted metadata values. e.g.

    external_metadata.cfg
    Author:"Shakespeare Type:Classics
  • Missing quotes on multi-word metadata values. e.g.

    external_metadata.cfg
    Author:William Shakespeare Type:Classics
  • Fancy quote characters. e.g.

    external_metadata.cfg
    Author:"William Shakespeare" Type:Classics
  • Line breaks or double quotes appearing within a metadata field. e.g.

    external_metadata.cfg
    Author:"John Smith" Description:"This book draws together all of Shakespeare's output into a single volume.
    "The collected works of Shakespeare" is something that should be included in all libraries." Type:Classics

Syntax errors can be mitigated by cleaning the variables as they are printed to the template using Squiz Matrix keyword modifiers.

Squiz Matrix and large asset listings

If there are going to be a large number of items that result from the asset listing it may be necessary to paginate the listings to generate external metadata.

This will require additional work on the Funnelback side to grab each of the pages, stitch the external metadata lines together and construct a single external_metadata.cfg. This could be implemented in the Funnelback pre_index_command by writing a shell or perl script to compile all of the pages into a single external_metadata.cfg that is written to the collection’s configuration folder.

Using HTTP headers instead of asset listings to provide metadata

This method requires no additional configuration within Funnelback (aside from defining appropriate metadata mappings and has the added benefit that binary document have sensible SEO-friendly URLs.
  1. Configure Squiz Matrix to manage the all binary document types (i.e. so that PDFs, MS Office documents are served by Matrix instead of Apache and they receive Squiz Matrix URLs instead of the __data urls.

  2. Configure Squiz Matrix to send the associated PDF metadata as a series of HTTP headers returned with the PDF. E.g.

    X-Document-Title=PDF document title
    X-Document-Keywords=Keyword 1|Keyword 2|Keyword 3
    X-Document-Description=This is the description
    X-Document-Author=John Smith
    X-Document-Date=2015-07-22
  3. Access a PDF document via HTTP and view the HTTP headers to confirm that they are being sent when the document is requested.

  4. Configure metadata mappings in Funnelback to index the HTTP headers.

  5. Run a full update in Funnelback and check to see that the metadata is detected and indexed.