crawl.pl gathers documents from web collections.
crawl.pl <collection config> [update type: -check, -incremental, -instant-update]
- The collection configuration file must be specified, and must be a filesystem path to an existing, readable and valid collection configuration file.
- "-check" may be specified
- "-incremental" may be specified
- "-instant-update" may be specified
crawl.pl uses the Funnelback crawler to gather web pages for a web collection. With no other options, web and collection specific settings are read from the collection configuration file and used to control the gather process. The update process will empty the offline view (as specified by the collection configuration data root setting), and then crawl the documents specified by the start URLs file, include patterns and exclude patterns settings, storing the retrieved documents in the data_root. See web collection for more information.
Optional arguments are described below.
If the -check option is used, crawling will resume from a previously stored checkpoint if possible. This also implies that the offline view will not be emptied before the gather starts.
If the -incremental option is used then the crawler will switch on incremental crawling mode — documents will be copied from the live view into the offline view if header information from the webserver indicates that the document has not changed since the last crawl. This saves on bandwidth and I/O.
If -instant-update is specified, then the crawler assumes that it is gathering new data that will form a small subset of the total stored data. In this case, the offline view is not emptied before the gather begins. >