Cache controller

Every page that a search engine indexes is stored locally by Funnelback. This cached version of the page can be viewed from the search results, with the search query terms highlighted within the content. The cached version of a search result is accessed from the downward caret icon beside the URL of a search result.

cached results 01

Clicking this menu item opens a cached version of the page with the queried keywords highlighted within the content. The cached version is the content as it was when Funnelback crawled the URL. The example below shows the cached version of this result when the user searched for the keyword quinoa.

cached results 02

Syntax

The cache controller will respond to any request with the name cache, regardless of the extension:

/s/cache?collection=...&url=...
/s/cache.xml?collection=...&url=...
/s/cache.custom?collection=...&url=...

This allows you to customize the cache URL depending on the type of content that’s returned.

The following URL parameters are accepted:

collection

ID of the collection containing the cached document

profile

Profile to use to select the FreeMarker cache template

form

Name of the FreeMarker cache template (without .ftl)

url

URL of the document to display

doc

As an alternative to url, the caller can specify the path to the cached document to display, relative to the collection’s live data folder

offset

When doc is used, the offset to start reading content from the document (in bytes)

length

When doc is used, then length of the document that should be read (number of bytes)

hl

Regular expression used to highlight the user’s query. Optional.

URL vs DOC parameter

It’s usually better to use the url parameter to access a cached document, as Funnelback will fetch the document regardless of the storage mechanism (WARC files, XML files, MirrorStore, etc). However, in some circumstances the document cannot be retrieved from its URL alone:

  • Collections that weren’t gathered using a Funnelback built-in component

  • Collection for which the URL of the document as seen by the indexer is different from the URL used to store the document at gathering time. That can happen if the document URL is customized by setting a document URL in the XML configuration.

If both url and doc are present, the cache controller will first attempt a lookup by URL, and it it fails, will try to locate the document from the doc parameter.

The doc parameter is expected to be a file path relative to the collection live data folder ($SEARCH_HOME/data/<collection>/live/data/), for example ...&doc=folder/sub-folder/file. Any other form of path will be rejected (absolute paths, paths with parent folder specifiers like ..).

Cache controller restrictions and limitations

Certain documents will not be returned by the cache controller:

  • Any document that was flagged as noarchive or nosnippet via robots meta tags.

  • XML documents generated as a result of padre’s XML document splitting (where the document field is set in the XML index settings).

  • All documents in an index where document level security is enabled.

Advanced features

The Modern UI cache controller provides flexible features to manipulate the document content before it’s sent back to the user.

Hook script

A pre_cache hook script will be run if present. This hook script should be created in $SEARCH_HOME/conf/<collection>/hook_pre_cache.groovy.

Data model

The data model available from within the hook script contains the following variables:

collection

A collection object identical to the one available in the search lifecycle hook scripts under the question.collection node. Collection configuration can be accessed through collection.configuration.value("key")

document

A composite object of class RecordAndMetadata with 2 fields: record contains the actual content of the cached document, and metadata is a map containing the document metadata from the storage layer (i.e. the metadata specifically written by the gathering component that stored the document, such as the web crawler).The document.record object is of class Record and contains the following fields:

primaryKey

Key used to store the record at gathering time. This is usually a URL, but could be something else like an ID from a database collection.

content

The actual content. This content can be of a different type depending on the type of data that was stored at gathering time.

Common content types are:

  • An array of bytes (Java type byte[]), usually from Web collections

  • An XML document (Java type org.w3c.dom.Document), usually from database, directory and connector collections

Return values, document modifications

The hook script can choose to return a boolean return value. If false is returned, access to the cached document will be denied and a 403 HTTP status code will be returned. If true is returned or if there’s no return value, access to the cached document is granted.

The hook script can modify the fields of the document object to replace the content and metadata of the document. The replaced content will be used as the cached document content instead of the original document. This is achieved by assigning new values to the field of the document object, e.g. document.record = new XmlRecord(...)

Examples

Example 1: Prevent users from accessing cached copies for a specific site section

def url = document.record.primaryKey
if (url =~ /.*\/protected\/.*/) {
  return false
} else {
  return true
}

Example 2: Build a completely new HTML document to use as a cached copy, ignoring the original document

def url = document.record.primaryKey
def html = "A new document"
document.record = new StringRecord(url, html)

This last example makes use of the StringRecord class, a type of Record where the content is a simple string (as opposite to an array of bytes or an XML document as discussed previously).

FreeMarker template and Jsoup

Similar to search templates, the cache controller makes use of FreeMarker templates to display the cached content.

Cache template naming and location

The cache templates must be created in the profile folders for a collection, and must end in .cache.ftl. The default template name is simple, i.e. simple.cache.ftl. Multiple templates can be created and selected using the form URL parameter. For example with the URL ...&form=mobile&profile=PROFILE... the following FreeMarker template will be used: $SEARCH_HOME/conf/<collection>/PROFILE/mobile.cache.ftl.

If no cache template exists, the default one used is $SEARCH_HOME/conf/simple.cache.ftl. It’s usually a good starting point and it’s recommended to copy it as an example in the profile folder of a collection when you want to create a customized template.

Data model and Jsoup

The following variables are available from the data model in the FreeMarker template:

url

URL of the document

collection

A collection object identical to the one available in the search lifecycle hook scripts under the question.collection node

profile

ID of the current profile

form

Name of the current cache form

requestURL

Current URL (URL used to access the cache controller, e.g. http://search.example.com/s/cache?collection=...)

metaData

Document storage metadata as a Map

httpRequest

Java’s HttpServletRequest for the current HTTP request

document

HTML document as a Jsoup object. Jsoup is a powerful HTML parser providing an easy API to modify HTML content. To use the Jsoup API within a FreeMarker template you’ll need to assign your API calls to temporary FreeMarker variables, as FreeMarker does not provide a way to call a function that doesn’t return a value. For example, if you wanted to append a CSS file to the HEAD of the document, use:

<#assign tmp = doc.head().append("<style type=\"text/css\" src=\"...\">") />

Once the document has been manipulated as desired, its HTML source must be written to the template in order to be rendered on the user’s browser. To do so, the last line of your template should be ${doc}.

For more examples, please consult the default cache template $SEARCH_HOME/conf/simple.cache.ftl.dist.

XML and XSL

By default, if a document is detected as XML, it will be returned as-is with an appropriate content type. Documents are detected as XML when:

  • They have been gathered by a component that stores XML records (database, directory collections)

  • They have been retrieved via the doc parameter (rather than url) and have a suffix of .xml

The cache controller also supports transforming XML documents using XSL stylesheets. XSL transformation will be automatically applied if a stylesheet named template.xsl exists under the current profile folder for the collection, e.g. $SEARCH_HOME/conf/<collection>/<profile>/template.xsl. Note that the default content type for the XSL output is text/xml. If the XSL is generating HTML output, the content type must be set to text/html with ui.modern.cache.form.[formName].content_type otherwise the web browser will attempt to interpret the generated HTML as XML and is likely to display an XML syntax error.

Custom content type

A custom content type can be set on a per-form basis, similar to search forms. This is done using the ui.modern.cache.form.[formName].content_type setting.

Custom headers

Custom headers can be set using the ui.modern.cache.form.[formName].headers.[key] setting.