Web crawler - common URL-related error messages
When debugging web crawls you may see some log messages similar to the following in the crawler thread (crawl.log.X
) log files:
- Rejected: <URL>
-
This usually means that the URL was rejected due to a match against robots.txt
- Unacceptable: <URL>
-
This usually means that the URL was rejected due to a match against an exclude pattern or not matching any of the include patterns.
- Unwanted type: [type] <URL>
-
This means that the URL was rejected due to a match against an unwanted mime type.
The url_errors.log
file reports on specific errors that occurred (such as HTTP errors) when accessing URLs. The log lines follow a format similar to the line below and are fairly self-explanatory
- E <URL> [Exceeds max_download_size: <SIZE>] [<TIMESTAMP>]
-
This means that the document was skipped because it was larger than the configured maximum file size (default is 10MB). The
crawler.max_download_size
setting can be used to increase this. - E <URL> [Can’t scan root page] [<TIMESTAMP>]
-
This means that the crawler was unable to scan the root page because it was prevented from fetching it.
- E <URL> [Can’t get text signature] [<TIMESTAMP>]
-
Indicates that when parsing the contents of the URL no text could be extracted. This indicates that the crawler didn’t find any content in the page to index.
- E <URL> [<HTTP STATUS CODE>] [<TIMESTAMP>]
-
A 40x or 50x HTTP status code was returned when fetching this page.
- Address <URL> exceeds maximum URL length
-
Indicates this URL was not gathered because the URL was too long. The default limit of 256 characters can be changed by adjusting the
crawler.max_url_length
setting.
Crawler trap prevention (restricted areas)
There are various conditions that are setup to prevent the web crawler from getting into a crawler trap, which occurs when the crawler gets into a situation where it might follow links indefinitely. This usually happens on generated pages. An example of a crawler trap is a calendar, where you might have a page to the next and previous months, and this is generated based on the current month with nothing to limit how far in time you can browse.
The restricted area messages indicate crawling of the identified URL was halted to prevent a crawler trap.
- New Restricted Area (size)
-
Indicates crawling of files starting with this part was halted because it is likely to be a crawler trap (like a calendar). By default, 10,000 documents from a folder or passed as parameters to a URL will be stored before the URL is restricted. This can be adjusted using the
crawler.max_files_per_area
setting. - New Restricted Area (max dir depth)
-
Indicates that a maximum directory/folder depth was reached and crawling of documents within these folders was restricted. By default, Funnelback will crawl documents up to 15 directories deep before restricting the crawling. The
crawler.max_directory_depth
setting can be used to adjust this limit. - New Restricted Area (repeating elements)
-
Indicates a likely crawler trap due to repeating elements was detected and crawling of URLs matching this element were halted. This can occur in dynamically generated sites where you might end up with urls like:
http://example.com/a/a/a/a/a/a/a/a/a/a
. By default, a path element can repeat five times before it is restricted. Thecrawler.max_url_repeating_elements
setting can be used to adjust this limit. - New Restricted Server
-
Indicates crawling of URLs associated with this server have been halted because the document limit for this server has been reached. The
crawler.max_files_per_server
setting can adjust this value. By default, no limits are applied on the number of documents for any given server (apart from the a configured limit for the data source, or an overall document limit for the license).