Plugin: Generate start URLs
Purpose
Use this plugin to generate a list of start URLs for a paginated API, where the total number of pages is read from an API response.
This plugin will generate a list of start urls for data source based a field that contains a total number of pages, which can then be used to generate page numbers which are passed in a URL to the API.
The plugin supports the sourcing of this value from a JSON or XML field, or from a HTTP response header.
The total page count is read from the API response and used to generate a set of start URLs based on that count.
The field containing the total count must only include the count, that corresponds to the total number of pages in the paginated response. Fields containing compound information or multiple values are not supported. The plugin also does not support APIs that require you to supply a record start/end or range where values are not constant for each URL. |
e.g. API calls like http://example.com/api/v1/events/?start=51&items-per-page=10
won’t work because you can’t generate this URL by inserting a page value. However, a URL like http://example.com/api/v1/events/?page=5&items-per-page=10
should work because the page parameter increments by 1 for each URL that is generated.
Authenticated feeds
The plugin includes support for using Basic HTTP authentication when accessing the URLs for start URL generation.
The plugin does not currently support other forms of authentication, and does not have access to any authentication configuration for the web crawler (such as form interaction or HTTP Basic/NTLM authentication configured for the crawler). |
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Generate start URLs tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
Total page API URL
Configuration key |
|
Data type |
string |
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is required |
Set this to the API URL that is used to obtain the total number of pages.
Total page API type
Configuration key |
|
Data type |
string |
Allowed values |
JSON,HTTP header,XML |
Required |
This setting is required |
This option configures the type of data source from where the total number of pages will be read.
Total page mode
Configuration key |
|
Data type |
string |
Default value |
|
Allowed values |
Page,Offset |
Required |
This setting is required |
This option configures the mode of the total page/offset.
HTTP header field name
Configuration key |
|
Data type |
string |
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is optional |
Defines the HTTP header which contains the total number of pages (Page mode) or total number of records (Offset mode), when using a HTTP header as the API type.
This field is required if the Total page API type
is set to HTTP header
.
JSON total pages field
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the JSON field (as a JSONPath expression) which contains the total number of pages (Page mode) or total number of records (Offset mode), when using JSON as the API type.
Supports expression in dot–notation or bracket–notation. For example, if the JSON response is {"total_pages": 10}
, the corresponding JSONPath expression should be $.total_pages
or $['total_pages']
. For more information on JSONPath, see RFC 9535.
This field is required if the Total page API type
is set to JSON
.
XML total pages field
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the XML field (as an XPath) which contains the total number of pages (Page mode) or total number of records (Offset mode), when using XML as the API type.
Supports standard X-Path notation. For example, if the XML response is <results><total_pages>10</total_pages></results>
, the XPath expression should be //total_pages
. XPath 3.0 is supported
This field is required if the Total page API type
is set to XML
.
Record per page
Configuration key |
|
Data type |
integer |
Default value |
|
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is required |
Number of records per page
Start URL template
Configuration key |
|
Data type |
string |
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is required |
Template used to define the start URLs that will be generated. ${page} for Page Mode and ${offset} for Offset Mode in the template, will be replaced with the page number or offset value when generating the URL list.
For example, if the URL is http://example.com/result?page=1
(Page mode) or http://example.com/result?offset=10
(Offset mode), the template should be http://example.com/result?page=${page}
(Page mode) or http://example.com/result?offset=${offset}
(Offset mode).
Total page API authentication type
Configuration key |
|
Data type |
string |
Default value |
|
Allowed values |
None,HTTP Basic Authentication |
Required |
This setting is required |
This option configures the API authentication type.
Only HTTP Basic authentication is supported.
HTTP basic authentication username
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the HTTP basic authentication username.
This field is required if the Total page API authentication type
is set to HTTP Basic Authentication
.
HTTP basic authentication password
Configuration key |
|
Data type |
Encrypted string |
Required |
This setting is optional |
Defines the HTTP basic authentication password.
This field is required if the Total page API authentication type
is set to HTTP Basic Authentication
.
Examples
Page mode
Generate start URLs from an HTTP header
Assume https://www.example.com/tutorial-lists
is an API that returns the total number of pages in the HTTP header field X-Total-Page-Count
.
Configuration key name | Value |
---|---|
Total page API |
|
Total page mode |
|
Total page API Type |
|
HTTP header field name |
|
Start URL template |
|
If X-Total-Page-Count
contains a value of 5
, the following list will be generated as start URLs for data source:
-
https://www.example.com/tutorial-list?page=1
-
https://www.example.com/tutorial-list?page=2
-
https://www.example.com/tutorial-list?page=3
-
https://www.example.com/tutorial-list?page=4
-
https://www.example.com/tutorial-list?page=5
Generate start URLs from JSON
Assume https://api.example.com/tutorial-lists?format=json
is an API
that returns the total number of pages in the JSON response field totalPageCount
as the following JSON response:
{
"tutorials": [
{
"title": "Tutorial 1",
"url": "https://www.example.com/tutorial-1"
},
{
"title": "Tutorial 2",
"url": "https://www.example.com/tutorial-2"
}
],
"totalPageCount": 5,
"currentPage": 1
}
Configuration key name | Value |
---|---|
Total page API |
|
Total page mode |
|
Total page API Type |
|
JSONPath |
|
Start URL template |
|
The following result will be generated as start URLs for data source:
-
https://api.example.com/tutorial-lists?format=json&page=1
-
https://api.example.com/tutorial-lists?format=json&page=2
-
https://api.example.com/tutorial-lists?format=json&page=3
-
https://api.example.com/tutorial-lists?format=json&page=4
-
https://api.example.com/tutorial-lists?format=json&page=5
Generate start URLs from XML
Assume https://api.example.com/tutorial-lists?format=xml
is an API that returns the total number of pages in the XML response field totalPage
as the following XML response:
<?xml version="1.0" encoding="UTF-8"?>
<tutorials>
<tutorial>
<title>Tutorial 1</title>
<url>https://www.example.com/tutorial-1</url>
</tutorial>
<tutorial>
<title>Tutorial 2</title>
<url>https://www.example.com/tutorial-2</url>
</tutorial>
<totalPage>5</totalPage>
<currentPage>1</currentPage>
</tutorials>
Configuration key name | Value |
---|---|
Total page API |
|
Total page mode |
|
Total page API Type |
|
XPath |
|
Start URL template |
|
The following result will be generated as start URLs for data source:
-
https://api.example.com/tutorial-lists?format=xml&page=1
-
https://api.example.com/tutorial-lists?format=xml&page=2
-
https://api.example.com/tutorial-lists?format=xml&page=3
-
https://api.example.com/tutorial-lists?format=xml&page=4
-
https://api.example.com/tutorial-lists?format=xml&page=5
Offset mode
Generate start URLs from an HTTP header
Assume https://www.example.com/tutorial-lists
is an API that returns the total number of records in the HTTP header field X-Total-Records-Count
.
Configuration key name | Value |
---|---|
Total page API |
|
Total page mode |
|
Total page API Type |
|
HTTP header field name |
|
Record per page |
|
Start URL template |
|
If X-Total-Records-Count
contains a value of 50
, the following list will be generated as start URLs for data source:
-
https://www.example.com/tutorial-list?offset=0
-
https://www.example.com/tutorial-list?offset=10
-
https://www.example.com/tutorial-list?offset=20
-
https://www.example.com/tutorial-list?offset=30
-
https://www.example.com/tutorial-list?offset=40
Generate start URLs from JSON
Assume https://api.example.com/tutorial-lists?format=json
is an API
that returns the total records in the JSON response field total
as the following JSON response:
{
"count": 5,
"tutorials": [
{
"title": "Tutorial 1",
"url": "https://www.example.com/tutorial-1"
},
{
"title": "Tutorial 2",
"url": "https://www.example.com/tutorial-2"
},
{
"title": "Tutorial 3",
"url": "https://www.example.com/tutorial-3"
},
{
"title": "Tutorial 4",
"url": "https://www.example.com/tutorial-4"
},
{
"title": "Tutorial 5",
"url": "https://www.example.com/tutorial-5"
}
],
"total": 25
}
Configuration key name | Value |
---|---|
Total page API |
|
Total page mode |
|
Total page API Type |
|
JSONPath |
|
Record per page |
|
Start URL template |
|
The following result will be generated as start URLs for data source:
-
https://api.example.com/tutorial-lists?format=json&offset=0
-
https://api.example.com/tutorial-lists?format=json&offset=5
-
https://api.example.com/tutorial-lists?format=json&offset=10
-
https://api.example.com/tutorial-lists?format=json&offset=15
-
https://api.example.com/tutorial-lists?format=json&offset=20
Generate start URLs from XML
Assume https://api.example.com/tutorial-lists?format=xml
is an API that returns the total records in the XML response field total
as the following XML response:
<?xml version="1.0" encoding="UTF-8"?>
<tutorials>
<tutorial>
<title>Tutorial 1</title>
<url>https://www.example.com/tutorial-1</url>
</tutorial>
<tutorial>
<title>Tutorial 2</title>
<url>https://www.example.com/tutorial-2</url>
</tutorial>
<tutorial>
<title>Tutorial 3</title>
<url>https://www.example.com/tutorial-3</url>
</tutorial>
<tutorial>
<title>Tutorial 4</title>
<url>https://www.example.com/tutorial-4</url>
</tutorial>
<tutorial>
<title>Tutorial 5</title>
<url>https://www.example.com/tutorial-5</url>
</tutorial>
<total>25</total>
<count>5</count>
</tutorials>
Configuration key name | Value |
---|---|
Total page API |
|
Total page mode |
|
Total page API Type |
|
XPath |
|
Record per page |
|
Start URL template |
|
The following result will be generated as start URLs for data source:
-
https://api.example.com/tutorial-lists?format=xml&offset=0
-
https://api.example.com/tutorial-lists?format=xml&offset=5
-
https://api.example.com/tutorial-lists?format=xml&offset=10
-
https://api.example.com/tutorial-lists?format=xml&offset=15
-
https://api.example.com/tutorial-lists?format=xml&offset=20