Plugin: Generate start URLs
Purpose
Use this plugin to generate a list of start URLs for a paginated API, where the total number of pages is read from an API response.
This plugin will generate a list of start urls for data source based a field that contains a total number of pages, which can then be used to generate page numbers which are passed in a URL to the API.
The plugin supports the sourcing of this value from a JSON or XML field, or from a HTTP response header.
The total page count is read from the API response and used to generate a set of start URLs based on that count.
The field containing the total count must only include the count, that corresponds to the total number of pages in the paginated response. Fields containing compound information or multiple values are not supported. The plugin also does not support APIs that require you to supply a record start/end or range where values are not constant for each URL. |
e.g. API calls like http://example.com/api/v1/events/?start=51&items-per-page=10
won’t work because you can’t generate this URL by inserting a page value. However, a URL like http://example.com/api/v1/events/?page=5&items-per-page=10
should work because the page parameter increments by 1 for each URL that is generated.
Authenticated feeds
The plugin includes support for using Basic HTTP authentication when accessing the URLs for start URL generation.
The plugin does not currently support other forms of authentication, and does not have access to any authentication configuration for the web crawler (such as form interaction or HTTP Basic/NTLM authentication configured for the crawler). |
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Generate start URLs tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
Total page API URL
Configuration key |
|
Data type |
string |
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is required |
Set this to the API URL that is used to obtain the total number of pages.
Total page API type
Configuration key |
|
Data type |
string |
Allowed values |
JSON,HTTP header,XML |
Required |
This setting is required |
This option configures the type of data source from where the total number of pages will be read.
HTTP header field name
Configuration key |
|
Data type |
string |
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is optional |
Defines the HTTP header which contains the total number of pages, when using a HTTP header as the API type.
This field is required if the Total page API type
is set to HTTP header
.
JSON total pages field
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the JSON field (as a JSONPath expression) which contains the total number of pages, when using JSON as the API type.
Supports expression in dot–notation or bracket–notation. For example, if the JSON response is {"total_pages": 10}
, the corresponding JSONPath expression should be $.total_pages
or $['total_pages']
. For more information on JSONPath, see RFC 9535.
This field is required if the Total page API type
is set to JSON
.
XML total pages field
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the XML field (as an XPath) which contains the total number of pages, when using XML as the API type.
Supports standard X-Path notation. For example, if the XML response is <results><total_pages>10</total_pages></results>
, the XPath expression should be //total_pages
. XPath 3.0 is supported
This field is required if the Total page API type
is set to XML
.
Start URL template
Configuration key |
|
Data type |
string |
Value format |
Allowed values must match the regular expression:
|
Required |
This setting is required |
Template used to define the start URLs that will be generated. ${page} in the template, will be replaced with the page number when generating the URL list.
For example, if the URL is http://example.com/page/1
, the template should be http://example.com/page/${page}
Total page API authentication type
Configuration key |
|
Data type |
string |
Default value |
|
Allowed values |
None,HTTP Basic Authentication |
Required |
This setting is required |
This option configures the API authentication type.
Only HTTP Basic authentication is supported.
HTTP basic authentication username
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
Defines the HTTP basic authentication username.
This field is required if the Total page API authentication type
is set to HTTP Basic Authentication
.
HTTP basic authentication password
Configuration key |
|
Data type |
Encrypted string |
Required |
This setting is optional |
Defines the HTTP basic authentication password.
This field is required if the Total page API authentication type
is set to HTTP Basic Authentication
.
Examples
Generate start URLs from a HTTP header
Assume https://www.example.com/tutorial-lists
is an API that returns the total number of pages in the HTTP header field X-Total-Page-Count
.
Configuration key name | Value |
---|---|
Total page API |
|
Total page API Type |
|
HTTP header field name |
|
Start URL template |
|
If X-Total-Page-Count
contains a value of 5
, the following list will be generated as start URLs for data source:
-
https://www.example.com/tutorial-list?page=1
-
https://www.example.com/tutorial-list?page=2
-
https://www.example.com/tutorial-list?page=3
-
https://www.example.com/tutorial-list?page=4
-
https://www.example.com/tutorial-list?page=5
Generate start URLs from JSON
Assume https://api.example.com/tutorial-lists?format=json
is an API
that returns the total number of pages in the JSON response field totalPageCount
as the following JSON response:
{
"tutorials": [
{
"title": "Tutorial 1",
"url": "https://www.example.com/tutorial-1"
},
{
"title": "Tutorial 2",
"url": "https://www.example.com/tutorial-2"
}
],
"totalPageCount": 5,
"currentPage": 1
}
Configuration key name | Value |
---|---|
Total page API |
|
Total page API Type |
|
JSONPath |
|
Start URL template |
|
The following result will be generated as start URLs for data source:
-
https://api.example.com/tutorial-lists?format=json&page=1
-
https://api.example.com/tutorial-lists?format=json&page=2
-
https://api.example.com/tutorial-lists?format=json&page=3
-
https://api.example.com/tutorial-lists?format=json&page=4
-
https://api.example.com/tutorial-lists?format=json&page=5
Generate start URLs from XML
Assume https://api.example.com/tutorial-lists?format=xml
is an API that returns the total number of pages in the XML response field totalPage
as the following XML response:
<?xml version="1.0" encoding="UTF-8"?>
<tutorials>
<tutorial>
<title>Tutorial 1</title>
<url>https://www.example.com/tutorial-1</url>
</tutorial>
<tutorial>
<title>Tutorial 2</title>
<url>https://www.example.com/tutorial-2</url>
</tutorial>
<totalPage>5</totalPage>
<currentPage>1</currentPage>
</tutorials>
Configuration key name | Value |
---|---|
Total page API |
|
Total page API Type |
|
JSONPath |
|
Start URL template |
|
The following result will be generated as start URLs for data source:
-
https://api.example.com/tutorial-lists?format=xml&page=1
-
https://api.example.com/tutorial-lists?format=xml&page=2
-
https://api.example.com/tutorial-lists?format=xml&page=3
-
https://api.example.com/tutorial-lists?format=xml&page=4
-
https://api.example.com/tutorial-lists?format=xml&page=5