crawler.classes.Frontier

Background

The web crawler’s frontier is the queue of URLs that are waiting to be processed. The following are the main frontier types available:

  • com.funnelback.common.frontier.MultipleRequestsFrontier: (default) Supports sending multiple parallel requests to servers as specified in site profiles configuration.

  • com.funnelback.common.frontier.HostFrontier: Uses separate queues for each host. The host frontier guarantees that the crawler will never have more than one active request with any host at any time.

  • com.funnelback.common.frontier.FIFOFrontier: This is a first-in first-out (FIFO) management scheme. Using this frontier allows the crawler to create multiple simultaneous connections to a target host. Using the FIFO frontier with 5 crawler threads (num_crawlers=5) means that a host could potentially receive 5 simultaneous requests from Funnelabck at any given time.

FIFO frontier can generate heavy load on a target web server.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.classes.Frontier key, and set the value. This can be set to any valid String value.

Default value

crawler.classes.Frontier=com.funnelback.common.frontier.MultipleRequestsFrontier:com.funnelback.common.frontier.DiskFIFOFrontier:1000

This specifies that a MultipleRequestsFrontier should be used, which in turn makes use of disk-based FIFOFrontiers of size 1000 each. Here size refers to the number of URLs per frontier i.e. in the example above each disk-based FIFOFrontier will store up to 1000 URLs each. When a frontier fills up a new one will be created so that all the URLs in the entire frontier are stored in a chain of backing frontiers.