Funnelback can store gathered content in a single WARC file for indexing, and this is the default storage option for most collection types. By compressing the content and avoiding the creation of a large number of files and directories for storage this approach saves on storage space. It also simplifies the transfer of gathered data to another machine in a multi-server setup.
Configuring WARC Storage
You can configure the use of a WARC store by specifying the appropriate Java class in your collection.cfg file:
- crawler.classes.URLStore for web collections
- store.raw-bytes.class for filecopy collections
- store.xml.class for database, connector and directory collections
There are a number of scripts which can be used to interact with WARC files:
- warc.pl General utility routines
- index-warc-files.pl Create an index (table of contents) for a WARC file. Used in system upgrades.