Site profiles
Defines web crawler interactions for sites within a web data source.
To access the site profiles configuration, open the data source configuration screen then select configure site profiles.
Site profiles can be used to customize how the web crawler interacts with a particular set of web sites. This can be useful in an environment where a variety of different hardware and software configurations are being interacted with.
Fields
The meaning of the fields is as follows:
Field | Value |
---|---|
Server or domain |
Exact server name or domain name should be in a valid URL or IP format. Partial URL patterns are not accepted. URL does not support wildcards or regular expressions |
This setting specifies the number of milliseconds to wait before making another request to a single host |
|
Maximum parallel requests |
The values you specify here is used to override the default value that have been specified for the crawl |
This parameter controls what revisit policy the web crawler uses when doing network calls to the server or domain of the site profile |
|
Username |
The optional username can be used to specify user account details for Crawling password protected websites on specific servers |
Password |
The optional password can be used to specify user account details for Crawling password protected websites on specific servers |
This parameter is used to specify an optional value for the maximum number of files the webcrawler should download during the crawl |
|
Comments |
Optional Comments |
Examples
-
If your data source consists of a website with http basic authentication, you can create a site-profile entry to specify rules for that particular domain.
Field | Value |
---|---|
Server or domain |
docs.example.com |
Request delay |
100 |
Maximum parallel requests |
4 |
Revisit policy |
AlwaysRevisitPolicy |
Username |
johnsmith |
Password |
pass1234 |
Maximum Files Stored |
1000 |
Comments |
-
If there is a site within a data source which has slow response time, you can specify
request delay
in a site-profile entry for that particular site.
Field | Value |
---|---|
Server or domain |
www.example.com |
Request delay |
200 |
Maximum parallel requests |
1 |
Revisit policy |
SimpleRevisitPolicy |
Username |
|
Password |
|
Maximum Files Stored |
|
Comments |
-
You can override the default
maximum parallel requests
limit by specifying a new value in a site-profile entry for a site within your data source.
Field | Value |
---|---|
Server or domain |
server.example.com |
Request delay |
500 |
Maximum parallel requests |
4 |
Revisit policy |
AlwaysRevisitPolicy |
Username |
user_one |
Password |
password_one |
Maximum Files Stored |
|
Comments |
-
You can enter an optional value for
maximum files stored
to specify the maximum number of files the web crawler should download during the crawl
Field | Value |
---|---|
Server or domain |
funnelback.com |
Request delay |
200 |
Maximum parallel requests |
2 |
Revisit policy |
SimpleRevisitPolicy |
Username |
|
Password |
|
Maximum Files Stored |
200 |
Comments |
Notes
-
The values in the list are used to override the default values that have been specified for the crawl as a whole. For example, the default number of parallel requests to each server is usually one, to try to be as polite as possible.
-
You will need to set the Frontier to be a MultipleRequestsFrontier for the max parallel requests setting to be taken into account.
-
The optional username and password can be used to specify user account details for crawling password protected content on specific servers.
-
If you wish to specify the max_files_stored value it needs to be the 7th field on the line, so you may need to have empty values for the optional username and password fields.
RESTFul API
The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on Site Profiles.