Crawl runs fine for a while then everything times out
This article provides advice on investigating a crawl that runs normally then starts to time out.
This could be caused by a number of things including:
-
The internet connection goes down during the crawl. Try to run another update later and see if the same behavior is exhibited.
-
A network device or policy set within network infrastructure may detect the crawl as denial of service attack, or the might be a policy that only allows a certain number of requests from an IP address within any time period. When this happens the request packets will often be silently drops resulting in request timeouts. You’ll see perfectly normal requests in the log file then suddenly all requests (usually to the same domain) will timeout for the rest of the crawl. Try to increase the crawler’s request delay (set the
crawler.request_delay
in the data source configuration) then run another crawl (note you might have to wait for a period of time before you’ll be allowed to crawl again). If this is successful you should try to get the Funnelback crawler IP (or user agent) added to the allow list for the network device. If this is not possible you can use the configuration option to set the request delay, of if it’s a multi-domain crawl and you only want to lower the request delay for a specific domain you can use Site profiles to set the delay only for the affected domain.Increasing the crawler.request_delay
in the data source configuration on a large multi-domain crawl will increase the time it takes to gather all the content. -
A session cookie required for the crawl has expired. If you are using form-based authentication you may need to set the
crawler.monitor_authentication_cookie_renewal_interval
option in the data source configuration. -
The crawler has run out of memory. Check the web crawler logs for out of memory errors. If you’ve run out of memory you may need to increase the allocated memory by setting
gather.max_heap_size
in the data source configuration.