Web crawler logs
The Funnelback web crawler writes its log files to the $SEARCH_HOME/data/<collection>/offline/log directory. Details on the main log files produced are given below.
Web crawler Logs
crawler.log
The main web crawler log file, which details the overall progress and status of the crawl.
crawl.log.N.gz
Individual crawler thread logs, where N is a number from 0 to the number of crawlers - 1 (default 19)
...
Rejected: http://www.funnelback.com/css/styles.css
Unacceptable: http://www.squiz.net
Cached: http://www.funnelback.com/our-products
<DOCHDR>
<BASE HREF="http://www.funnelback.com/our-products">
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Vary: Accept-Encoding
Date: Fri, 16 May 2014 01:40:35 GMT
Cache-Control: private
X-UA-Compatible: IE=edge
Content-Length: 67764
X-FRAME-OPTIONS: SAMEORIGIN
X-Funnelback-Stored-Length: 67679
X-Funnelback-Last-Modification-Seen: 2014:05:16T11:40:35
X-Funnelback-Num-Times-Unchanged: 0
X-Funnelback-Num-Times-Copied: 0
X-Funnelback-Num-Times-Revisit-Skipped: 0
</DOCHDR>
Frontier Delay: 138 ms for frontier which contained URL:
http://www.funnelback.com/our-products
Process: http://www.funnelback.com/our-products
Contacting http://www.funnelback.com/our-products [11:40:35:743]
GET connected to http://www.funnelback.com/our-products
GET Request Bytes: 274 Response Header Bytes: 221 URL:
http://www.funnelback.com/our-products
GET from http://www.funnelback.com/our-products [text/html; charset=utf-8] [62013]
Signalled frontier for host: www.funnelback.com
Parsing http://www.funnelback.com/our-products
Parsed http://www.funnelback.com/our-products
Content Bytes: 61918 URL: http://www.funnelback.com/our-products
Scanner: http://www.funnelback.com/our-products
Extracted_Text: ...
MD5/Hash: 59d188b6f6a8bde267b796e2c9f1660f 1327251562 http://www.funnelback.com/our-products
http://www.funnelback.com/our-products sig_cache_size: 32
...
stored.log
Lists all the URLs which were successfully stored by the crawler in chronological order.
http://www.funnelback.com http://www.funnelback.com/our-products http://www.funnelback.com/our-products/enterprise-search ...
url_errors.log
Details all errors encountered during the crawl when attempting to fetch URLs, including HTTP status codes, network exceptions, link extraction, etc.
E http://www.funnelback.com/missing-page [404 Not Found] [2014:06:16:09:57:53] E http://www.funnelback.com/large-file.pdf [Exceeds max_download_size: 104405535] [2014:01:09:11:46:53] E http://www.funnelback.com/secure-section [403 Forbidden] [2014:06:16:09:57:53] E http://www.funnelback.com/gone [410 Gone] [2014:06:16:09:57:53] E http://www.funnelback.com/blogs [Can't scan root page] [2014:03:31:21:20:53] E http://www.funnelback.com/popular-page [Net Error: Read timed out] [2014:06:03:10:11:16] E http://www.funnelback.com/popular-page-2 [Net Error: Connection reset] [2014:06:03:19:46:44] E xttp://www.funnelback.com/ [Link Extraction: java.net.MalformedURLException: unknown protocol: xttp] [2014:06:03:11:28:38] ...
redirects.txt
All encountered redirects, including HTTP redirects, HTML meta refresh, canonical URL references and server aliases.
# H = HTTP Redirect, M = Meta Refresh (HTML) Redirect, A = Aliased Server URL, D = Duplicate (based on MD5 of extracted text), C = Canonical Link Directive A http://www2.funnelback.com/ -> http://www.funnelback.com/ H http://www.funnelback.com/rss -> http://www.funnelback.com/feeds/rss D http://www.funnelback.com/news/latest -> http://www.funnelback.com/news C http://www.funnelback.com/news?id=1234 -> http://www.funnelback.com/news/2014/03/02/-title M http://www.funnelback.com/searchbetter -> http://www.funnelback.com/files/white-paper/search-better.pdf ...
servers.log
Details all individual sub-domains that were encountered, how many documents were left in the frontier at the end of the crawl as well as how many were stored.
# Server Frontier Stored http://www.funnelback.com/ 3 456 http://docs.funnelback.com/ 0 1234 https://docs.funnelback.com/ 0 1 ...