Cache controller
Every page that a search engine indexes is stored locally by Funnelback. This cached version of the page can be viewed from the search results, with the search query terms highlighted within the content. The cached version of a search result is accessed from the downward caret icon beside the URL of a search result.
Clicking this menu item opens a cached version of the page with the queried keywords highlighted within the content. The cached version is the content as it was when Funnelback crawled the URL. The example below shows the cached version of this result when the user searched for the keyword quinoa.
Syntax
The cache controller will respond to any request with the name cache, regardless of the extension:
/s/cache?collection=...&url=... /s/cache.xml?collection=...&url=... /s/cache.custom?collection=...&url=...
This allows you to customize the cache URL depending on the type of content that’s returned.
The following URL parameters are accepted:
collection
ID of the collection containing the cached document
profile
Profile to use to select the FreeMarker cache template
form
Name of the FreeMarker cache template (without .ftl
)
url
URL of the document to display
doc
As an alternative to url, the caller can specify the path to the cached document to display, relative to the collection’s live data folder
offset
When doc is used, the offset to start reading content from the document (in bytes)
length
When doc is used, then length of the document that should be read (number of bytes)
hl
Regular expression used to highlight the user’s query. Optional.
URL vs DOC parameter
It’s usually better to use the url parameter to access a cached document, as Funnelback will fetch the document regardless of the storage mechanism (WARC files, XML files, MirrorStore, etc). However, in some circumstances the document cannot be retrieved from its URL alone:
-
Collections that weren’t gathered using a Funnelback built-in component
-
Collection for which the URL of the document as seen by the indexer is different from the URL used to store the document at gathering time. That can happen if the document URL is customized by setting a document URL in the XML configuration.
If both url and doc are present, the cache controller will first attempt a lookup by URL, and it it fails, will try to locate the document from the doc parameter.
The doc parameter is expected to be a file path relative to the collection live data folder ($SEARCH_HOME/data/<collection>/live/data/
), for example ...&doc=folder/sub-folder/file
. Any other form of path will be rejected (absolute paths, paths with parent folder specifiers like ..
).
Cache controller restrictions and limitations
Certain documents will not be returned by the cache controller:
-
Any document that was flagged as
noarchive
ornosnippet
via robots meta tags. -
XML documents generated as a result of padre’s XML document splitting (where the document field is set in the XML index settings).
-
All documents in an index where document level security is enabled.
Advanced features
The Modern UI cache controller provides flexible features to manipulate the document content before it’s sent back to the user.
Hook script
A pre_cache
hook script will be run if present. This hook script should be created in $SEARCH_HOME/conf/<collection>/hook_pre_cache.groovy
.
Data model
The data model available from within the hook script contains the following variables:
collection
A collection object identical to the one available in the search lifecycle hook scripts under the question.collection
node. Collection configuration can be accessed through collection.configuration.value("key")
document
A composite object of class RecordAndMetadata
with 2 fields: record
contains the actual content of the cached document, and metadata
is a map containing the document metadata from the storage layer (i.e. the metadata specifically written by the gathering component that stored the document, such as the web crawler).The document.record
object is of class Record
and contains the following fields:
primaryKey
Key used to store the record at gathering time. This is usually a URL, but could be something else like an ID from a database collection.
content
The actual content. This content can be of a different type depending on the type of data that was stored at gathering time.
Common content types are:
-
An array of bytes (Java type
byte[]
), usually from Web collections -
An XML document (Java type
org.w3c.dom.Document
), usually from database, directory and connector collections
Return values, document modifications
The hook script can choose to return a boolean return value. If false
is returned, access to the cached document will be denied and a 403 HTTP status code will be returned. If true
is returned or if there’s no return value, access to the cached document is granted.
The hook script can modify the fields of the document object to replace the content and metadata of the document. The replaced content will be used as the cached document content instead of the original document. This is achieved by assigning new values to the field of the document object, e.g. document.record = new XmlRecord(...)
Examples
Example 1: Prevent users from accessing cached copies for a specific site section
def url = document.record.primaryKey
if (url =~ /.*\/protected\/.*/) {
return false
} else {
return true
}
Example 2: Build a completely new HTML document to use as a cached copy, ignoring the original document
def url = document.record.primaryKey
def html = "A new document"
document.record = new StringRecord(url, html)
This last example makes use of the StringRecord
class, a type of Record
where the content is a simple string (as opposite to an array of bytes or an XML document as discussed previously).
FreeMarker template and Jsoup
Similar to search templates, the cache controller makes use of FreeMarker templates to display the cached content.
Cache template naming and location
The cache templates must be created in the profile folders for a collection, and must end in .cache.ftl
. The default template name is simple, i.e. simple.cache.ftl
. Multiple templates can be created and selected using the form
URL parameter. For example with the URL ...&form=mobile&profile=PROFILE...
the following FreeMarker template will be used: $SEARCH_HOME/conf/<collection>/PROFILE/mobile.cache.ftl
.
If no cache template exists, the default one used is $SEARCH_HOME/conf/simple.cache.ftl
. It’s usually a good starting point and it’s recommended to copy it as an example in the profile folder of a collection when you want to create a customized template.
Data model and Jsoup
The following variables are available from the data model in the FreeMarker template:
url
URL of the document
collection
A collection object identical to the one available in the search lifecycle hook scripts under the question.collection
node
profile
ID of the current profile
form
Name of the current cache form
requestURL
Current URL (URL used to access the cache controller, e.g. http://search.example.com/s/cache?collection=...
)
metaData
Document storage metadata as a Map
httpRequest
Java’s HttpServletRequest for the current HTTP request
document
HTML document as a Jsoup object. Jsoup is a powerful HTML parser providing an easy API to modify HTML content. To use the Jsoup API within a FreeMarker template you’ll need to assign your API calls to temporary FreeMarker variables, as FreeMarker does not provide a way to call a function that doesn’t return a value. For example, if you wanted to append a CSS file to the HEAD of the document, use:
<#assign tmp = doc.head().append("<style type=\"text/css\" src=\"...\">") />
Once the document has been manipulated as desired, its HTML source must be written to the template in order to be rendered on the user’s browser. To do so, the last line of your template should be ${doc}
.
For more examples, please consult the default cache template $SEARCH_HOME/conf/simple.cache.ftl.dist
.
XML and XSL
By default, if a document is detected as XML, it will be returned as-is with an appropriate content type. Documents are detected as XML when:
-
They have been gathered by a component that stores XML records (database, directory collections)
-
They have been retrieved via the doc parameter (rather than url) and have a suffix of
.xml
The cache controller also supports transforming XML documents using XSL stylesheets. XSL transformation will be automatically applied if a stylesheet named template.xsl
exists under the current profile folder for the collection, e.g. $SEARCH_HOME/conf/<collection>/<profile>/template.xsl
. Note that the default content type for the XSL output is text/xml
. If the XSL is generating HTML output, the content type must be set to text/html
with ui.modern.cache.form.[formName].content_type
otherwise the web browser will attempt to interpret the generated HTML as XML and is likely to display an XML syntax error.
Custom content type
A custom content type can be set on a per-form basis, similar to search forms. This is done using the ui.modern.cache.form.[formName].content_type
setting.
Custom headers
Custom headers can be set using the ui.modern.cache.form.[formName].headers.[key]
setting.