Troubleshooting web data source gathering

The web crawler is used by web data sources to gather content via HTTP requests. It is mostly used for gathering content from websites but is also used to gather content from REST APIs and other content delivered via HTTP URLs.

General advice

Key log files

The crawler.log captures the top-level information for the web crawler and is a good place to start when debugging issues with a web data source.
The crawler-central.log is useful when debugging issues relating to filtering.
The url_errors.log captures most of the errors that were logged during a web crawl. Note - these are normally non-fatal (see below).
The crawl.log.X.log files capture the thread-level message recorded by the web crawler and contain the most detail.

See: Web gatherer log files below for details on the full set of web crawler log files.

Non-fatal errors

The web crawler will encounter many non-fatal errors during a web crawl. These errors will not cause a web data source to record a failed update.

Non-fatal errors include:

HTTP requests that result in a non 200 status code being returned
Documents that failed to filter correctly
Documents that are larger than the crawler’s configured maximum download size
Crawler traps

These are mostly logged to the url_errors.log.

Fatal errors

The following errors are considered fatal and will result in an update failure:

Storing 0 documents: see: URLStore reported no documents stored
Web crawler running out of memory: see: Gatherer out of memory: java.lang.OutOfMemoryError: Java heap space

Debug API

The debug API allows you to trace a HTTP request made by the web crawler. It follows redirects and shows you all the HTTP headers and responses along with the returned data. This is very helpful for investigating issues with form-based authentication as well as general issues relating to web crawler requests.

The debug API is available as part of the Funnelback admin API, accessed from the API-UI link in the menu.

Specific issues

Specific error messages

Web gatherer log files

The following log files are produced by the web crawler.

binaries.log

Location: Data source - live/offline logs
Purpose: List of binary files stored e.g. (PDF, DOC etc.) Appended to on restart from checkpoint.

copied_urls.log

Location: Data source - live/offline logs
Purpose: Log output if crawler.incremental_logging=true. All URLs whose content was copied from the previous crawl, as they had not changed and so were not downloaded again.

crawler.log

Location: Data source - live/offline logs
Purpose: Contains filter messages.

crawler_logs_checkpoint_sizes.dat

Location: Data source - live/offline logs
Purpose: Records the sizes of crawler logs at the time a checkpoint occurred. This allows truncating them back to that size if the crawler is restarted from a checkpoint to avoid inconsistency.

gather.log

Location: Data source - live/offline logs
Purpose: Top level log for the web crawler.

crawl.<N>.log

Location: Data source - live/offline logs
Purpose: Logs individual messages for each running crawler thread. One log per thread.

domains.log

Location: Data source - live/offline logs
Purpose: Frontier and stored document counts for each domain encountered during the crawl.

frontier_dump.log

Location: Data source - live/offline logs
Purpose: Contains a dump of the crawl frontier (set of known, uncrawled URLs).

headers.log

Location: Data source - live/offline logs
Purpose: Captures HTTPS headers recorded during the crawl. Output if crawler.header_logging=true

manifest.txt

Location: Data source - live/offline logs
Purpose: Records the order in which bundles were created.

monitor.log

Location: Data source - live/offline logs
Purpose: Records various crawl statistics.

new_urls.log

Location: Data source - live/offline logs
Purpose: A new URL is defined as one which was not stored in the previous crawl. Log output if crawler.incremental_logging=true.

redirects.txt

Location: Data source - live/offline logs
Purpose: Appended to on restart from checkpoint. Captures redirects.

servers.log

Location: Data source - live/offline logs
Purpose: Frontier and stored document counts for each server encountered during the crawl.

stored.log

Location: Data source - live/offline logs
Purpose: All URLs stored during a crawl (also used for refresh updates), in chronological order. Appended to on restart from checkpoint.

store-messages.log

Location: Data source - live/offline logs
Purpose: Records URLs stored into a WARC/Mirror store as well as edit distance calculation logs and certain error/warning states of MirrorStore.

url_errors.log

Location: Data source - live/offline logs
Purpose: Records errors while processing URLs.

url_no_content.log

Location: Data source - live/offline logs
Purpose: Contains URLs which are stored despite having no content. Documents with no content are usually the result of a filter returning an empty document. This occurs when crawler.store_empty_content_urls=true.

url_titles.log

Location: Data source - live/offline logs
Purpose: Lists URLs and their titles.

BroadMIMETypeStatistic.stat

Location: Data source - live/offline logs
Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.

BroadWebServerTypeStatistic.stat

Location: Data source - live/offline logs
Purpose: Records web-server types in general categories (anything after a / or ( is truncated) - e.g. Will capture Apache without the version.

CrawlSizeStatistic.stat

Location: Data source - live/offline logs
Purpose: Records the total URLs stored by a web crawl.

FileSizeByDocumentTypeStatistic.stat

Location: Data source - live/offline logs
Purpose: The file sizes by document type report displays statistics on file sizes found, divided by content type.

FileSizeStatistic.stat

Location: Data source - live/offline logs
Purpose: The file sizes report displays statistics on content sizes found.

MIMETypeStatistic.stat

Location: Data source - live/offline logs
Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.

ReferencedFileTypeStatistic.stat

Location: Data source - live/offline logs
Purpose: Records the file extensions of URLs seen in href/src HTML attributes even if those URLs would not be crawled.

SuffixTypeStatistic.stat

Location: Data source - live/offline logs
Purpose: The file types by suffix report displays statistics on document types as identified by checking of the suffix.

URLlengthStatistic.stat

Location: Data source - live/offline logs
Purpose: Records the number of URLs of different lengths seen during a crawl.

WebServerTypeStatistic.stat

Location: Data source - live/offline logs
Purpose: Records statistics the web server types (including versions etc.) seen during the crawl.

Help Center

Menu