Site profiles

Defines web crawler interactions for sites within a web data source.

To access the site profiles configuration, open the data source configuration screen then select configure site profiles.

Site profiles can be used to customize how the web crawler interacts with a particular set of web sites. This can be useful in an environment where a variety of different hardware and software configurations are being interacted with.

Fields

The meaning of the fields is as follows:

Field Value

Server or domain

Exact server name or domain name should be in a valid URL or IP format. Partial URL patterns are not accepted. URL does not support wildcards or regular expressions

Request delay

This setting specifies the number of milliseconds to wait before making another request to a single host

Maximum parallel requests

The values you specify here is used to override the default value that have been specified for the crawl

Revisit policy

This parameter controls what revisit policy the web crawler uses when doing network calls to the server or domain of the site profile

Username

The optional username can be used to specify user account details for Crawling password protected websites on specific servers

Password

The optional password can be used to specify user account details for Crawling password protected websites on specific servers

Maximum Files Stored

This parameter is used to specify an optional value for the maximum number of files the webcrawler should download during the crawl

Comments

Optional Comments

Examples

  1. If your data source consists of a website with http basic authentication, you can create a site-profile entry to specify rules for that particular domain.

Field Value

Server or domain

docs.example.com

Request delay

100

Maximum parallel requests

4

Revisit policy

AlwaysRevisitPolicy

Username

johnsmith

Password

pass1234

Maximum Files Stored

1000

Comments

  1. If there is a site within a data source which has slow response time, you can specify request delay in a site-profile entry for that particular site.

Field Value

Server or domain

www.example.com

Request delay

200

Maximum parallel requests

1

Revisit policy

SimpleRevisitPolicy

Username

Password

Maximum Files Stored

Comments

  1. You can override the default maximum parallel requests limit by specifying a new value in a site-profile entry for a site within your data source.

Field Value

Server or domain

server.example.com

Request delay

500

Maximum parallel requests

4

Revisit policy

AlwaysRevisitPolicy

Username

user_one

Password

password_one

Maximum Files Stored

Comments

  1. You can enter an optional value for maximum files stored to specify the maximum number of files the web crawler should download during the crawl

Field Value

Server or domain

funnelback.com

Request delay

200

Maximum parallel requests

2

Revisit policy

SimpleRevisitPolicy

Username

Password

Maximum Files Stored

200

Comments

Notes

  • The values in the list are used to override the default values that have been specified for the crawl as a whole. For example, the default number of parallel requests to each server is usually one, to try to be as polite as possible.

  • You will need to set the Frontier to be a MultipleRequestsFrontier for the max parallel requests setting to be taken into account.

  • The optional username and password can be used to specify user account details for crawling password protected content on specific servers.

  • If you wish to specify the max_files_stored value it needs to be the 7th field on the line, so you may need to have empty values for the optional username and password fields.

RESTFul API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on Site Profiles.