crawler.classes.Frontier
Specifies the java class used for the frontier (a list of URLs not yet visited).
Key: crawler.classes.Frontier
Type: String
Can be set in: collection.cfg
Description
The web crawler’s frontier is the queue of URLs that are waiting to be processed. The following are the main frontier types available:
-
com.funnelback.common.frontier.MultipleRequestsFrontier
: (default) Supports sending multiple parallel requests to servers as specified in site profiles configuration. -
com.funnelback.common.frontier.HostFrontier
: Uses separate queues for each host. The host frontier guarantees that the crawler will never have more than one active request with any host at any time. -
com.funnelback.common.frontier.FIFOFrontier
: This is a first-in first-out (FIFO) management scheme. Using this frontier allows the crawler to create multiple simultaneous connections to a target host. Using the FIFO frontier with 5 crawler threads (num_crawlers=5
) means that a host could potentially receive 5 simultaneous requests from Funnelabck at any given time.
FIFO frontier can generate heavy load on a target web server. |
Default Value
crawler.classes.Frontier=com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000
This specifies that a MultipleRequestsFrontier should be used, which in turn makes use of disk-based
FIFOFrontiers
of size 1000 each. Here size refers to the number of URLs per frontier i.e. in the
example above each disk-based FIFOFrontier
will store up to 1000 URLs each. When a frontier fills
up a new one will be created so that all the URLs in the entire frontier are stored in a chain of
backing frontiers.