Web crawler logs
The Funnelback web crawler writes its log files to the $SEARCH_HOME/data/<collection>/offline/log
directory. Details on the main log files produced are given below.
Web crawler Logs
crawler.log
The main web crawler log file, which details the overall progress and status of the crawl.
crawl.log.N.gz
Individual crawler thread logs, where N is a number from 0 to the number of crawlers - 1 (default 19)
... Rejected: http://www.funnelback.com/css/styles.css Unacceptable: http://www.squiz.net Cached: http://www.funnelback.com/our-products <DOCHDR> <BASE HREF="http://www.funnelback.com/our-products"> HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Vary: Accept-Encoding Date: Fri, 16 May 2014 01:40:35 GMT Cache-Control: private X-UA-Compatible: IE=edge Content-Length: 67764 X-FRAME-OPTIONS: SAMEORIGIN X-Funnelback-Stored-Length: 67679 X-Funnelback-Last-Modification-Seen: 2014:05:16T11:40:35 X-Funnelback-Num-Times-Unchanged: 0 X-Funnelback-Num-Times-Copied: 0 X-Funnelback-Num-Times-Revisit-Skipped: 0 </DOCHDR> Frontier Delay: 138 ms for frontier which contained URL: http://www.funnelback.com/our-products Process: http://www.funnelback.com/our-products Contacting http://www.funnelback.com/our-products [11:40:35:743] GET connected to http://www.funnelback.com/our-products GET Request Bytes: 274 Response Header Bytes: 221 URL: http://www.funnelback.com/our-products GET from http://www.funnelback.com/our-products [text/html; charset=utf-8] [62013] Signalled frontier for host: www.funnelback.com Parsing http://www.funnelback.com/our-products Parsed http://www.funnelback.com/our-products Content Bytes: 61918 URL: http://www.funnelback.com/our-products Scanner: http://www.funnelback.com/our-products Extracted_Text: ... MD5/Hash: 59d188b6f6a8bde267b796e2c9f1660f 1327251562 http://www.funnelback.com/our-products http://www.funnelback.com/our-products sig_cache_size: 32 ...
stored.log
Lists all the URLs which were successfully stored by the crawler in chronological order.
http://www.funnelback.com http://www.funnelback.com/our-products http://www.funnelback.com/our-products/enterprise-search ...
url_errors.log
Details all errors encountered during the crawl when attempting to fetch URLs, including HTTP status codes, network exceptions, link extraction, etc.
E http://www.funnelback.com/missing-page [404 Not Found] [2014:06:16:09:57:53] E http://www.funnelback.com/large-file.pdf [Exceeds max_download_size: 104405535] [2014:01:09:11:46:53] E http://www.funnelback.com/secure-section [403 Forbidden] [2014:06:16:09:57:53] E http://www.funnelback.com/gone [410 Gone] [2014:06:16:09:57:53] E http://www.funnelback.com/blogs [Can't scan root page] [2014:03:31:21:20:53] E http://www.funnelback.com/popular-page [Net Error: Read timed out] [2014:06:03:10:11:16] E http://www.funnelback.com/popular-page-2 [Net Error: Connection reset] [2014:06:03:19:46:44] E xttp://www.funnelback.com/ [Link Extraction: java.net.MalformedURLException: unknown protocol: xttp] [2014:06:03:11:28:38] ...
redirects.txt
All encountered redirects, including HTTP redirects, HTML meta refresh, canonical URL references and server aliases.
# H = HTTP Redirect, M = Meta Refresh (HTML) Redirect, A = Aliased Server URL, D = Duplicate (based on MD5 of extracted text), C = Canonical Link Directive A http://www2.funnelback.com/ -> http://www.funnelback.com/ H http://www.funnelback.com/rss -> http://www.funnelback.com/feeds/rss D http://www.funnelback.com/news/latest -> http://www.funnelback.com/news C http://www.funnelback.com/news?id=1234 -> http://www.funnelback.com/news/2014/03/02/-title M http://www.funnelback.com/searchbetter -> http://www.funnelback.com/files/white-paper/search-better.pdf ...
servers.log
Details all individual sub-domains that were encountered, how many documents were left in the frontier at the end of the crawl as well as how many were stored.
# Server Frontier Stored http://www.funnelback.com/ 3 456 http://docs.funnelback.com/ 0 1234 https://docs.funnelback.com/ 0 1 ...