Troubleshooting web data source gathering
The web crawler is used by web data sources to gather content via HTTP requests. It is mostly used for gathering content from websites but is also used to gather content from REST APIs and other content delivered via HTTP URLs.
General advice
Key log files
-
The
crawler.log
captures the top-level information for the web crawler and is a good place to start when debugging issues with a web data source. -
The
crawler-central.log
is useful when debugging issues relating to filtering. -
The
url_errors.log
captures most of the errors that were logged during a web crawl. Note - these are normally non-fatal (see below). -
The
crawl.log.X.log
files capture the thread-level message recorded by the web crawler and contain the most detail.
See: Web gatherer log files below for details on the full set of web crawler log files.
Non-fatal errors
The web crawler will encounter many non-fatal errors during a web crawl. These errors will not cause a web data source to record a failed update.
Non-fatal errors include:
-
HTTP requests that result in a non 200 status code being returned
-
Documents that failed to filter correctly
-
Documents that are larger than the crawler’s configured maximum download size
-
Crawler traps
These are mostly logged to the url_errors.log
.
Fatal errors
The following errors are considered fatal and will result in an update failure:
-
Storing 0 documents: see: URLStore reported no documents stored
-
Web crawler running out of memory: see: Gatherer out of memory: java.lang.OutOfMemoryError: Java heap space
Debug API
The debug API allows you to trace a HTTP request made by the web crawler. It follows redirects and shows you all the HTTP headers and responses along with the returned data. This is very helpful for investigating issues with form-based authentication as well as general issues relating to web crawler requests.
The debug API is available as part of the Funnelback admin API, accessed from the API-UI link in the menu.
Web gatherer log files
The following log files are produced by the web crawler.
binaries.log
-
Location: Data source - live/offline logs
-
Purpose: List of binary files stored e.g. (PDF, DOC etc.) Appended to on restart from checkpoint.
copied_urls.log
-
Location: Data source - live/offline logs
-
Purpose: Log output if
crawler.incremental_logging=true
. All URLs whose content was copied from the previous crawl, as they had not changed and so were not downloaded again.
crawler.log
-
Location: Data source - live/offline logs
-
Purpose: Contains filter messages.
crawler_logs_checkpoint_sizes.dat
-
Location: Data source - live/offline logs
-
Purpose: Records the sizes of crawler logs at the time a checkpoint occurred. This allows truncating them back to that size if the crawler is restarted from a checkpoint to avoid inconsistency.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Top level log for the web crawler.
crawl.<N>.log
-
Location: Data source - live/offline logs
-
Purpose: Logs individual messages for each running crawler thread. One log per thread.
domains.log
-
Location: Data source - live/offline logs
-
Purpose: Frontier and stored document counts for each domain encountered during the crawl.
frontier_dump.log
-
Location: Data source - live/offline logs
-
Purpose: Contains a dump of the crawl frontier (set of known, uncrawled URLs).
headers.log
-
Location: Data source - live/offline logs
-
Purpose: Captures HTTPS headers recorded during the crawl. Output if
crawler.header_logging=true
manifest.txt
-
Location: Data source - live/offline logs
-
Purpose: Records the order in which bundles were created.
monitor.log
-
Location: Data source - live/offline logs
-
Purpose: Records various crawl statistics.
new_urls.log
-
Location: Data source - live/offline logs
-
Purpose: A new URL is defined as one which was not stored in the previous crawl. Log output if
crawler.incremental_logging=true
.
redirects.txt
-
Location: Data source - live/offline logs
-
Purpose: Appended to on restart from checkpoint. Captures redirects.
servers.log
-
Location: Data source - live/offline logs
-
Purpose: Frontier and stored document counts for each server encountered during the crawl.
stored.log
-
Location: Data source - live/offline logs
-
Purpose: All URLs stored during a crawl (also used for refresh updates), in chronological order. Appended to on restart from checkpoint.
store-messages.log
-
Location: Data source - live/offline logs
-
Purpose: Records URLs stored into a WARC/Mirror store as well as edit distance calculation logs and certain error/warning states of MirrorStore.
url_errors.log
-
Location: Data source - live/offline logs
-
Purpose: Records errors while processing URLs.
url_no_content.log
-
Location: Data source - live/offline logs
-
Purpose: Contains URLs which are stored despite having no content. Documents with no content are usually the result of a filter returning an empty document. This occurs when
crawler.store_empty_content_urls=true
.
url_titles.log
-
Location: Data source - live/offline logs
-
Purpose: Lists URLs and their titles.
BroadMIMETypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.
BroadWebServerTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records web-server types in general categories (anything after a / or ( is truncated) - e.g. Will capture Apache without the version.
CrawlSizeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records the total URLs stored by a web crawl.
FileSizeByDocumentTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file sizes by document type report displays statistics on file sizes found, divided by content type.
FileSizeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file sizes report displays statistics on content sizes found.
MIMETypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.
ReferencedFileTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records the file extensions of URLs seen in href/src HTML attributes even if those URLs would not be crawled.
SuffixTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file types by suffix report displays statistics on document types as identified by checking of the suffix.
URLlengthStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records the number of URLs of different lengths seen during a crawl.
WebServerTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records statistics the web server types (including versions etc.) seen during the crawl.