Site profiles

Defines web crawler interactions for sites within a web data source.

To access the site profiles configuration, open the data source configuration screen then select configure site profiles.

Site profiles can be used to customize how the web crawler interacts with a particular set of web sites. This can be useful in an environment where a variety of different hardware and software configurations are being interacted with.

Fields

The meaning of the fields is as follows:

Field	Value
Server or domain	Exact server name or domain name should be in a valid URL or IP format. Partial URL patterns are not accepted. URL does not support wildcards or regular expressions
Request delay	This setting specifies the number of milliseconds to wait before making another request to a single host
Maximum parallel requests	The values you specify here is used to override the default value that have been specified for the crawl
Revisit policy	This parameter controls what revisit policy the web crawler uses when doing network calls to the server or domain of the site profile
Username	The optional username can be used to specify user account details for Crawling password protected websites on specific servers
Password	The optional password can be used to specify user account details for Crawling password protected websites on specific servers
Maximum Files Stored	This parameter is used to specify an optional value for the maximum number of files the webcrawler should download during the crawl
Comments	Optional Comments

Field

Value

Server or domain

Exact server name or domain name should be in a valid URL or IP format. Partial URL patterns are not accepted. URL does not support wildcards or regular expressions

Request delay

This setting specifies the number of milliseconds to wait before making another request to a single host

Maximum parallel requests

The values you specify here is used to override the default value that have been specified for the crawl

Revisit policy

This parameter controls what revisit policy the web crawler uses when doing network calls to the server or domain of the site profile

Username

The optional username can be used to specify user account details for Crawling password protected websites on specific servers

Password

The optional password can be used to specify user account details for Crawling password protected websites on specific servers

Maximum Files Stored

This parameter is used to specify an optional value for the maximum number of files the webcrawler should download during the crawl

Comments

Optional Comments

Examples

If your data source consists of a website with http basic authentication, you can create a site-profile entry to specify rules for that particular domain.

Field	Value
Server or domain	docs.example.com
Request delay	100
Maximum parallel requests	4
Revisit policy	AlwaysRevisitPolicy
Username	johnsmith
Password	pass1234
Maximum Files Stored	1000
Comments

Field

Value

Server or domain

docs.example.com

Request delay

100

Maximum parallel requests

Revisit policy

AlwaysRevisitPolicy

Username

johnsmith

Password

pass1234

Maximum Files Stored

1000

Comments

If there is a site within a data source which has slow response time, you can specify request delay in a site-profile entry for that particular site.

Field	Value
Server or domain	www.example.com
Request delay	200
Maximum parallel requests	1
Revisit policy	SimpleRevisitPolicy
Username
Password
Maximum Files Stored
Comments

Field

Value

Server or domain

www.example.com

Request delay

200

Maximum parallel requests

Revisit policy

SimpleRevisitPolicy

Username

Password

Maximum Files Stored

Comments

You can override the default maximum parallel requests limit by specifying a new value in a site-profile entry for a site within your data source.

Field	Value
Server or domain	server.example.com
Request delay	500
Maximum parallel requests	4
Revisit policy	AlwaysRevisitPolicy
Username	user_one
Password	password_one
Maximum Files Stored
Comments

Field

Value

Server or domain

server.example.com

Request delay

500

Maximum parallel requests

Revisit policy

AlwaysRevisitPolicy

Username

user_one

Password

password_one

Maximum Files Stored

Comments

You can enter an optional value for maximum files stored to specify the maximum number of files the web crawler should download during the crawl

Field	Value
Server or domain	funnelback.com
Request delay	200
Maximum parallel requests	2
Revisit policy	SimpleRevisitPolicy
Username
Password
Maximum Files Stored	200
Comments

Field

Value

Server or domain

funnelback.com

Request delay

200

Maximum parallel requests

Revisit policy

SimpleRevisitPolicy

Username

Password

Maximum Files Stored

200

Comments

Notes

The values in the list are used to override the default values that have been specified for the crawl as a whole. For example, the default number of parallel requests to each server is usually one, to try to be as polite as possible.
You will need to set the Frontier to be a MultipleRequestsFrontier for the max parallel requests setting to be taken into account.
The optional username and password can be used to specify user account details for crawling password protected content on specific servers.
If you wish to specify the max_files_stored value it needs to be the 7th field on the line, so you may need to have empty values for the optional username and password fields.

RESTFul API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on Site Profiles.

Help Center

Menu

Site profiles

Fields

Examples

Notes

RESTFul API

See also