update.pl

update.pl is a wrapper around the entire collection update process, and calls the appropriate update subscripts.

update.pl <collection config> [update type]
update.pl <collection config> "-instant_update" <start url> [include pattern] [exclude pattern]
update.pl <collection config> "-instant_document_add" [file_1] [file_2] [file_3] ... [file_n]
update.pl <collection config> "-instant_document_delete" [file_1] [file_2] [file_3] ... [file_n]

Arguments

  • The collection configuration file must be specified, and must be a filesystem path to an existing, readable and valid collection configuration file.

  • Update type must be one of:

    • "-full"
    • "-incremental" (only valid for web and database collections)
    • "-checkpoint" (only valid for web collections)
    • "-inc_checkpoint" (only valid for web collections)
    • "-index"
    • "-swap-views"
    • "-reindex"
    • "-gather-only"
    • "-index-only"
    • "-instant-update" (only valid for web and filecopy collections)
    • "-instant-document-add" (only valid for web and filecopy collections)
    • "-instant-document-delete" (only valid for web and filecopy collections)
  • If no update type is specified, "-full" is assumed.

Function

Broadly, update.pl will update a collection, in a manner relevant to the collection's type. The exact type of update depends on the update type specified.

"-full" (or no update type specified)

The collection will be fully updated: documents will be gathered to the offline directory, indexed, and then swapped into the live directory.

"-incremental"

The collection will be incrementally updated. What this means exactly varies by collection, but in most cases it means that bandwidth / IO will be conserved as much as possible, by trying to use previously gathered data. For example, the crawler will check with webservers to see if already crawled documents have been changed, and if they have not, the previously crawled document will be used to save the cost of downloading a new document. Already filtered documents will also not be filtered again. Other than these changes, an incremental update is identical to a full update.

"-index"

This update type starts an update from the index stage - skipping the gather stage, and proceeding to the swap stage. This is useful should an update crash or be halted after gathering has taken place, but before indexing has finished.

"-swap-views"

This update type starts an update from the final swap stage - skipping all other stages. This is useful should an update crash just after indexing but before swapping views. It can also be used to swap in an older copy of the data.

"-reindex"

This update type reindexes the current live directory, and does not perform any swapping of views. It can be useful for reindexing data that has been manually added to the live directory - this is not a recommended way of updating.

"-gather-only"

This update type performs the gather stage of updating only. Documents will be downloaded or copied and filtered, but they will not be indexed or swapped.

"-index-only"

This update type performs the indexing stage of updating only. Documents will not be gathered, filtered or swapped.

"-checkpoint"

Restarts a web update from a checkpoint if the web update has halted unexpectedly.

"-inc_checkpoint"

Restarts an incremental web update from a checkpoint if the web update has halted unexpectedly.

"-instant-update"

If "-instant-update" is specified for a web collection, the next parameter is interpreted as a start URL, and documents under this URL will be gathered and added to the collections live data and query index, without a full reindex being performed. An include pattern may optionally be specified after the start URL parameter, and an exclude pattern may optionally be specified subsequently.

"-instant-document-add"

If "-instant-document-add" is specified, the files specified subsequently on the command line will be added to the collections live data directory and query index, without a full reindex being performed. This can be used to quickly add files to the collections index.

"-instant-document-delete"

If "-instant-document-delete" is specified, the files specified subsequently on the command line will be deleted from the collections live data directory and query index, without a full reindex being performed. This can be used to quickly delete files from a collection.

See also

top