crawler.classes.Frontier
Background
The web crawler’s frontier is the queue of URLs that are waiting to be processed. The following are the main frontier types available:
-
com.funnelback.common.frontier.MultipleRequestsFrontier
: (default) Supports sending multiple parallel requests to servers as specified in site profiles configuration. -
com.funnelback.common.frontier.HostFrontier
: Uses separate queues for each host. The host frontier guarantees that the crawler will never have more than one active request with any host at any time. -
com.funnelback.common.frontier.FIFOFrontier
: This is a first-in first-out (FIFO) management scheme. Using this frontier allows the crawler to create multiple simultaneous connections to a target host. Using the FIFO frontier with 5 crawler threads (num_crawlers=5
) means that a host could potentially receive 5 simultaneous requests from Funnelabck at any given time.
FIFO frontier can generate heavy load on a target web server. |
Setting the key
Set this configuration key in the search package or data source configuration.
Use the configuration key editor to add or edit the crawler.classes.Frontier
key, and set the value. This can be set to any valid String
value.
Default value
crawler.classes.Frontier=com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000
This specifies that a MultipleRequestsFrontier should be used, which in turn makes use of disk-based
FIFOFrontiers
of size 1000 each. Here size refers to the number of URLs per frontier i.e. in the
example above each disk-based FIFOFrontier
will store up to 1000 URLs each. When a frontier fills
up a new one will be created so that all the URLs in the entire frontier are stored in a chain of
backing frontiers.