Funnelback logo

Documentation

Cache.cgi

Introduction

The cache.cgi script is responsible for displaying the cached documents that have been gathered by Funnelback during an update. It will highlight the search terms when it displays the document.

Syntax

The URL to cache.cgi is simply:

http://host/search/cache.cgi?collection=ID&doc=path

Where:

ID
is the collection ID;
doc
is the file path to the document (see below).

Cached documents

During an update, documents retrieved from the data source will be written to the server's disk using a path similar to $SEARCH_HOME/data/<collection>/offline/data/<path>

where path is the path to the document. For example, the document from:

http://host/shakespeare/romeo_juliet/index.html

will be downloaded by the crawler and written to (on a Unix installation):

/opt/funnelback/data/collection/data/offline/host/shakespeare/romeo_juliet/index.html

If the update is successful, then the offline and live views will be swapped.

Normally the /data directory will not be accessible by your web server. Funnelback uses cache.cgi to read the cached files from disk and display them in the end-user's browser.

Document types

During an update "binary" document types, such as PDF, are converted to plain text and kept as the cached document, thus cache.cgi will display:

  • HTML documents;
  • Plain text, extracted from binary documents; or
  • XML documents, possible via an XSLT.

Security

If you do not want people to access the cached documents, then you can either:

These two methods will prevent people from accessing the cached documents even if they manually enter a URL that uses cache.cgi. However, deleting the documents will cause incremental updates to download the documents again, instead of re-using the cached versions that have not changed.

NB: Even if you turn off cached copies people can still see the document snippets in the query results.

See also

collection.cfg options

top ⇑