Web crawler logs

The Funnelback web crawler writes its log files to the $SEARCH_HOME/data/<collection>/offline/log directory. Details on the main log files produced are given below.

Web crawler Logs


The main web crawler log file, which details the overall progress and status of the crawl.


Logs messages from filters that run as part of the crawl.


Individual crawler thread logs, where N is a number from 0 to the number of crawlers - 1 (default 19)

  Rejected: http://www.funnelback.com/css/styles.css
  Unacceptable: http://www.squiz.net
  Cached: http://www.funnelback.com/our-products

<BASE HREF="http://www.funnelback.com/our-products">
     HTTP/1.1 200 OK
     Content-Type: text/html; charset=utf-8
     Vary: Accept-Encoding
     Date: Fri, 16 May 2014 01:40:35 GMT
     Cache-Control: private
     X-UA-Compatible: IE=edge
     Content-Length: 67764
     X-Funnelback-Stored-Length: 67679
     X-Funnelback-Last-Modification-Seen: 2014:05:16T11:40:35
     X-Funnelback-Num-Times-Unchanged: 0
     X-Funnelback-Num-Times-Copied: 0
     X-Funnelback-Num-Times-Revisit-Skipped: 0

   Frontier Delay: 138 ms for frontier which contained URL:
   Process: http://www.funnelback.com/our-products
   Contacting http://www.funnelback.com/our-products [11:40:35:743]
   GET connected to http://www.funnelback.com/our-products
   GET Request Bytes: 274 Response Header Bytes: 221 URL:
   GET from http://www.funnelback.com/our-products [text/html; charset=utf-8] [62013]
   Signalled frontier for host: www.funnelback.com
   Parsing http://www.funnelback.com/our-products
   Parsed http://www.funnelback.com/our-products
   Content Bytes: 61918 URL: http://www.funnelback.com/our-products
   Scanner: http://www.funnelback.com/our-products
   Extracted_Text: ...
   MD5/Hash: 59d188b6f6a8bde267b796e2c9f1660f 1327251562  http://www.funnelback.com/our-products
   http://www.funnelback.com/our-products sig_cache_size: 32


Lists all the URLs which were successfully stored by the crawler in chronological order.



Details all errors encountered during the crawl when attempting to fetch URLs, including HTTP status codes, network exceptions, link extraction, etc.

  E http://www.funnelback.com/missing-page [404 Not Found] [2014:06:16:09:57:53]
  E http://www.funnelback.com/large-file.pdf [Exceeds max_download_size: 104405535] [2014:01:09:11:46:53]
  E http://www.funnelback.com/secure-section [403 Forbidden] [2014:06:16:09:57:53]
  E http://www.funnelback.com/gone [410 Gone] [2014:06:16:09:57:53]
  E http://www.funnelback.com/blogs [Can't scan root page] [2014:03:31:21:20:53]
  E http://www.funnelback.com/popular-page [Net Error: Read timed out] [2014:06:03:10:11:16]
  E http://www.funnelback.com/popular-page-2 [Net Error: Connection reset] [2014:06:03:19:46:44]
  E xttp://www.funnelback.com/ [Link Extraction: java.net.MalformedURLException: unknown protocol: xttp] [2014:06:03:11:28:38]


All encountered redirects, including HTTP redirects, HTML meta refresh, canonical URL references and server aliases.

  # H = HTTP Redirect, M = Meta Refresh (HTML) Redirect, A = Aliased Server URL, D = Duplicate (based on MD5 of extracted text), C = Canonical Link Directive
  A http://www2.funnelback.com/ -> http://www.funnelback.com/
  H http://www.funnelback.com/rss -> http://www.funnelback.com/feeds/rss
  D http://www.funnelback.com/news/latest -> http://www.funnelback.com/news
  C http://www.funnelback.com/news?id=1234 -> http://www.funnelback.com/news/2014/03/02/-title
  M http://www.funnelback.com/searchbetter -> http://www.funnelback.com/files/white-paper/search-better.pdf


Details all individual sub-domains that were encountered, how many documents were left in the frontier at the end of the crawl as well as how many were stored.

# Server Frontier Stored
http://www.funnelback.com/ 3 456
http://docs.funnelback.com/ 0 1234
https://docs.funnelback.com/ 0 1


Same as servers.log but on a domain-only basis, where all sub-domains are accumulated.

# Domain Frontier Stored
funnelback.com 3 1691

See also