crawl.pl

crawl.pl gathers documents from web collections.

$ crawl.pl <data source config> [update type: -check, -incremental, -instant-update]

Arguments

  • The data source configuration file must be specified, and must be a filesystem path to an existing, readable and valid data source configuration file.

Optional arguments:

  • -check: If the -check option is used, crawling will resume from a previously stored checkpoint if possible. This also implies that the offline view will not be emptied before the gather starts.

  • -incremental: If the -incremental option is used then the crawler will switch on incremental crawling mode and documents will be copied from the live view into the offline view if header information from the webserver indicates that the document has not changed since the last crawl. This saves on bandwidth and I/O.

  • -instant-update: If -instant-update is specified, then the crawler assumes that it is gathering new data that will form a small subset of the total stored data. In this case, the offline view is not emptied before the gather begins.

Function

crawl.pl uses the Funnelback web crawler to gather web pages for a web data source. The update process will empty the offline view ($SEARCH_HOME/data/$COLLECTION_NAME/offline/data), and then crawl the documents specified by the start URLs file, include patterns and exclude patterns settings, storing the retrieved documents in the offline view data folder ($SEARCH_HOME/data/$COLLECTION_NAME/offline/data).

See also