WarcCat
This feature is not available in the Squiz DXP. |
WarcCat is a tool that can be used to inspect and perform basic operations on warc files.
Usage
WarcCat <Matcher> <Print options>
To get the usage of WarcCat run (from the $SEARCH_HOME):
$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat
Warc stem
-
-stem <warc stem>
: The input stem of the warc file to be displayed e.g./foo/bar
for the warc file/foo/bar.warc
.
Matcher
Specifies the records within the warc file that will be selected.
One of the following should be specified:
-
-matcher MatchAll
: (Default) Match every record in the warc file. -
-matcher Bounded -MF start=<N> -MF end=<N>
: Match a range of records from the warc file. Thestart
andend
values specify the range, where<N>=1
is the first record in the warc file. -
-matcher HeaderFieldRegex -MF headerFieldName=<NAME> -MF regex=<EXPR>
: Match a set of values from a specified header using a regular expression. -
-matcher MatchURI -MF uri=<URI>
: Match a specific URI. -
-matcher RegexURI -MF regex=<EXPR>
: Match a set of URIs using a regular expression. -
-matcher MatchStartOfURI -MF prefix=<PREFIX>
: Match a set of URIs using a URI prefix.
Print options
Specifies how the selected records should be printed.
One of the following should be specified:
-
-printer All -PF newLineBreakBetween=<B>
: (Default) Print the warc headers and uncompressed content for the matching records from the warc file. -
-printer AllCompressed -PF newLineBreakBetween=<B>
: Print the warc headers and compressed content for the matching records from the warc file. -
-printer ContentUncompressed -PF newLineBreakBetween=<B>
: Print the content (uncompressed) only for the matching records from the warc file. -
-printer HeaderOnly -PF newLineBreakBetween=<B>
: Print the warc headers only for the matching records from the warc file. -
-printer SplitIntoFiles -PF prefix=<PREFIX> -PF recordsPerFile=<N> -PF overwrite=<B>
: Split a warc file into n-document chunks saved as separate warc files.prefix
is the file name stem of the output warc files.recordsPerFile
sets the maximum number of documents to include in the split warc files. Whenoverwrite
is set to true existing output warc files will be overwritten.
The default value for newLineBreakBetween
is false
.
An additional option controlling the overall warc file header can also be specified:
-
-printWarcInfo <B>
: (Default =true
) Print the warc file header. This should be printed at the start of any warc file. Set to false when appending records to an existing warc file.
Examples
View a warc file
Display everything in a warc file, $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl.warc
:
$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/COLLECTION/live/data/funnelback-web-crawl
Create a warc file containing documents from other warc files
Create a warc file which consists of two documents from another warc file. First we will extract one document:
$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/COLLECTION/live/data/funnelback-web-crawl -matcher MatchURI -MF "uri=http://funnelback.com/" -printer AllCompressed -printWarcInfo true > /tmp/newWarcFile.warc
Breaking down that command, we set the matcher to the MatchURI
type which requires the uri to be set as well using the -MF
to be set followed by uri=<doc URI>
. We set the printer to AllCompressed
which will print out both the headers and the content, this will compresses the content part to save space. Finally we set the -printWarcInfo
to true, which prepends the warc header to the file. To append the second document to the warc file we run:
$SEARCH_HOME/linbin/java/bin/java -classpath "$SEARCH_HOMElib/java/all/*" com.funnelback.warc.util.WarcCat -stem data/COLLECTION/live/data/funnelback-web-crawl -matcher MatchURI -MF "uri=http://docs.funnelback.com/" -printer AllCompressed -printWarcInfo false >> /tmp/newWarcFile.warc
This time we set -printWriterInfo
to false
, as the warc file already has a warc file header.
Split a warc file into several smaller files
The split into files printer can be used to take an input warc file (indicated by the stem parameter) and split it into multiple files containing n records (indicated by the recordsPerFile parameter).
/opt/funnelback/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*" com.funnelback.warc.util.WarcCat -stem $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl -matcher MatchAll -printer SplitIntoFiles -PF prefix=$SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl-split -PF recordsPerFile=100000 -PF overwrite=true
The bounded matcher can also be used to extract a range of records from a warc file.
# Extract first 100000 records from funnelback-web-crawl.warc and write it to funnelback-web-crawl-1.warc
/opt/funnelback/linbin/java/bin/java -classpath "$SEARCH_HOME/lib/java/all/*:target/funnelback-warc-library.jar" com.funnelback.warc.util.WarcCat -stem $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl -matcher Bounded -MF first=1 -MF last=100000 -printer AllCompressed > $SEARCH_HOME/data/COLLECTION/live/data/funnelback-web-crawl-1.warc