Update the index of a data source

Start or stop an update of a data source

Update from the search package data source listing

  1. From the search dashboard home page locate your search package in the main listing.

  2. Locate the data source that you wish to update and select update (or advanced update) from the update menu associated with the data source.

Update from the data source configuration screen

  1. From the data source configuration screen select update (or advanced update) from the update section.

Update from the data source configuration screen

  1. From the data source management screen locate the data source that you wish to update in the main listing.

  2. Select update (or advanced update) from the update menu associated with the data source.

Scheduling data source index updates

Data sources can be scheduled to update at pre-determined times using the OS task scheduler. See: Scheduling updates

Data source update phases

Updating a data source occurs in four phases:

Fb-update-steps.png
  • Gather: Is the process of collecting the data to be indexed. For example, web collections will use a web crawler to download web pages and follow links found in these pages. Binary documents, for example PDF, need to be processed to extract the plain text.

  • Index: Process the documents which were gathered and indexes words, phrases, HTML anchor text, and so on.

  • Report: Scan over the documents that have been gathered and filtered, producing reports on their content.

  • Swap: All of the above work occurs in an offline view to prevent disrupting the current live view which is being used for query processing. If the update process completed successfully, the live and offline views will be swapped, making the new indexes available for querying.

Advanced updates

A data source has various advanced update modes that offer the ability to start a special type of update, or to resume a stopped or failed update from a specific update phase. The set of available modes will vary depending on the type of data source.

The available advanced update modes are:

  • Full update: Performs a complete gather (i.e. web crawl, database export etc) from scratch. Documents will be gathered, filtered and then indexed before they are swapped into the live area. This is the default setting.

  • Incremental update (Web/Database data sources only): Performs a complete web crawl, but rather than downloading all documents, the web crawler will check against the previous crawl, and if the document is unchanged, it is not re-downloaded. When crawling has completed, new documents are filtered and all documents are indexed before being swapped into the live area.

  • Refresh update (Web data source only): This is similar to an incremental crawl. It operates by copying all data from the live view to the offline view and then crawling on top of that. The crawl would then be configured to crawl less than the usual full or incremental crawl, so that it "refreshes" a subset of the data. URLs which generate an exception (e.g. "404 Not Found") during the crawl will be removed from the store. For some store types at the end of the crawl the list of crawled files (manifest) is merged with the previous manifest to ensure that all files are indexed in crawl order. This update type is different to an "Instant Update" in that it still uses the (potentially large) list of include/exclude patterns for the collection, rather than a restricted list defined for an instant update.

  • Add or re-add the specified site/directory (Instant update) (Web/Filecopy data sources only): Performs a limited crawl and re-index restricted to the start URL or directory and include/exclude patterns provided. This option is for including new content into an index as quickly as possible, without taking as long as a full update might take to complete. It is strongly recommended that full updates still be used regularly.

  • Remove the specified site/directory (Instant update) (Web/Filecopy data sources only): Removes the specified site or directory from the index, so that resources within this site or directory will no longer be returned as search results. (Note that this works on a URL or path prefix and may not contain any spaces)

  • Add or re-add the specified URLs/files (Instant update) (Web/Filecopy data sources only): Performs a crawl and re-index of the specified URLs or files only (i.e. links will not be followed). This option is for updating important pages in the index quickly, without taking as long as a full update might take. It is strongly recommended that full updates still be used regularly.

  • Remove the specified URLs/files (Instant update): Removes the specified URLs or files from the index, so that these items will no longer be returned as search results.

  • Restart crawl (Web collections only): Restarts a full web crawl from the latest checkpoint. When crawling has completed, the documents are filtered and indexed before being swapped into the live area. You can use this to resume a halted or crashed crawl from the most recent checkpoint.

  • Restart incremental crawl (Web data source only): As above, but restarts the web crawl as an incremental update rather than a full update.

  • Index: Indexes the existing offline data and then swaps the live and offline views. You can use this to index data from an update that has crashed or been halted.

  • Swap live and offline view: Swaps the live and offline views. This activates the offline view and deactivates the live view. You can use this to restore the previous view in the event that one of the above update types proves undesirable or if you wish to revert to the previous crawl and index.

  • Re-index live view: Indexes the existing live data and then activates the new indexes directly. This can be used to bring indexing configuration changes into effect directly without re-gathering data.