Configuring instant updates

Background

Instant updates are a way of adding and removing content from web collections between full or incremental updates.

Update process

An instant update is initiated in one of the following ways:

  • via the admin UI - selecting advanced update and choosing one of the four available start new instant update options (note: web and filecopy collections only).

  • via the feed API (Funnelback 15.12 and earlier) - see: Instant updates: Funnelback feeds

  • running update.pl from the command line and specifying an instant update mode.

When the instant update runs new content (generated from one of the add update types) is handled separately to the files associated with the main index. Specifically:

  • Data is downloaded to a secondary-data folder (and stored in a MirrorStore, regardless of the storage class specified in the collection.cfg)

  • Indexes are written to the live/idx folder with an index stem of index_secondary ($SEARCH_HOME/data/$COLLECTION_NAME/live/idx/index_secondary)

When content is added to a secondary index a check is made to see if it exists in the current live index and is killed if it found to exist.

An instant delete operation is equivalent to killing the document from the live index.

Instant update phases

Instant update

Purpose: Add or re-add a site to the index.

Update phases:

  • instant-gather

  • instant-convert

  • instant-index

Instant delete

Purpose: Remove sites from the index.

Update phases:

  • delete-prefix

Instant document add

Purpose: Add or re-add individual URLs to the index)

Update phases:

  • instant-gather

  • instant-convert

  • instant-index

Instant document delete

Purpose: Remove individual URLs from the index.

Update phases:

  • delete-list

Configuring instant updates

Workflow

For instant updates to run correctly any workflow commands applied to a normal update need to be configured for the instant update using the equivalent instant update workflow commands applied to the correct instant update phases (as listed above). This means that you will need to add workflow commands for both standard and instant updates.

As a general rule:

  • Convert any pre/post_gather commands to pre/post_instant-gather commands and add these to the collection.cfg (leave the existing pre/post commands as these are still needed for normal updates).

  • Convert any pre/post_index commands to pre/post_instant-index commands (leave the existing pre/post commands as these are still needed for normal updates).

  • Convert any index stems in the added commands to use $SEARCH_HOME/data/$COLLECTION_NAME/live/idx/index_secondary instead of $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index.

  • Replace any uses of $CURRENT_VIEW in the instant update commands with live.

  • Post update (and post swap, pre/post meta depenencies etc) workflow should be run as post instant-index, post delete-prefix and post delete-list commands as instant updates only have the phases listed above (there is no swap, meta dependencies and hence no post-update phase).

e.g.

# Post index command, runs on normal updates
post_index_command=$SEARCH_HOME/bin/padre-gs $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/gscopes.cfg
# Equivalent instant update command
post_instant-index_command=$SEARCH_HOME/bin/padre-gs $SEARCH_HOME/data/$COLLECTION_NAME/live/idx/index_secondary $SEARCH_HOME/conf/$COLLECTION_NAME/gscopes.cfg

Administration interface

The following admin UI permissions are available:

  • sec.instant.update - controls ability to run any sort of update. There is no current option to allow access to run only instant updates.

  • sec.perform.feed - controls the user’s ability to access the feed interface.

If you wish an administration interface user to have access to instant updates you will need to grant one or both of these permissions by editing their .ini file (located under $SEARCH_HOME/admin/users/<USERNAME>.ini

You also need to set the user type to custom.

e.g.

# Allow access to instant updating in the admin UI
sec.instant.update = yes
# Allow access to instant updating via the API/feed interface
sec.perform.feed = yes

Logs and indexes

  • High level update messages are written to the standard update-<COLLECTION NAME>.log.

  • Detailed logs are written to the collection’s live/secondary-logs folder

  • Indexes are written to the collection’s live/secondary-index folder while being built (a bit like offline) then moved to the live/idx folder with a stem of index_secondary.

Caution / gotchas

  • It is possible to link the Feed interface into the workflow of external programs. For example it might be desirable to call the feed interface from a CMS as part of the workflow that occurs when a page is published. This should be avoided to prevent too many requests being raised at a single point in time (eg. if a bulk change is made). Instead the CMS should be configured to batch up changes and submit to the feed interface on a regular interval.

  • It is not possible to start and instant update if the collection is already updating.

  • Running an instant update will lock the collection from updating (so a standard update, or other instant updates can’t be run while an instant update is running).

  • Instant updates apply only to web and filecopy collections.