crawler.max_files_per_area

Background

This option sets the limit for the number of files within an area. Here "area" is defined as either a static directory or a generator e.g. index.asp?doc=123.

This parameter was previously called crawler.max_dir_size - the name was changed to show that generators are also included in this definition.

If the crawler encounters an area on a site with the URL:

http://www.example.com/white_papers/

and downloads multiple files from within this area/directory then it will stop downloading any further content from this directory once the specified limit is reached.

A similar approach is used for generators e.g. if we encounter a generator like:

http://www.example.com/index.asp?doc=1

and have downloaded multiple URLs generated by this index.asp script then the crawler will download no more from this generator when the limit is reached.

Lotus Notes generator scripts (.nsf) look like directories e.g.

http://www.example.com/publish.nsf/content/doc123/

In this example if "publish.nsf" generates more than the limit we will not request more content from it, even though from the URL it looks like there are other directories or areas underneath it.

If you are trying to crawl a dynamically generated site which has a lot of content generated from a single generator then you may need to increase the default value for this parameter if you are not getting back as much content as you expect.

Setting the key

Set this configuration key in the search package or data source configuration.

Use the configuration key editor to add or edit the crawler.max_files_per_area key, and set the value. This can be set to any valid Integer value.

Default value

crawler.max_files_per_area=10000

See also