Plugin: Generate start URLs

Purpose

Use this plugin to generate a list of start URLs for a paginated API, where the total number of pages is read from an API response.

This plugin will generate a list of start urls for data source based a field that contains a total number of pages, which can then be used to generate page numbers which are passed in a URL to the API.

The plugin supports the sourcing of this value from a JSON or XML field, or from a HTTP response header.

The total page count is read from the API response and used to generate a set of start URLs based on that count.

The field containing the total count must only include the count, that corresponds to the total number of pages in the paginated response. Fields containing compound information or multiple values are not supported. The plugin also does not support APIs that require you to supply a record start/end or range where values are not constant for each URL.

e.g. API calls like http://example.com/api/v1/events/?start=51&items-per-page=10 won’t work because you can’t generate this URL by inserting a page value. However, a URL like http://example.com/api/v1/events/?page=5&items-per-page=10 should work because the page parameter increments by 1 for each URL that is generated.

Authenticated feeds

The plugin includes support for using Basic HTTP authentication when accessing the URLs for start URL generation.

The plugin does not currently support other forms of authentication, and does not have access to any authentication configuration for the web crawler (such as form interaction or HTTP Basic/NTLM authentication configured for the crawler).

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Generate start URLs tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Total page API URL

Configuration key

plugin.generate-start-urls.config.total-page-api

Data type

string

Value format

Allowed values must match the regular expression:

((http|https)://)(www.)?[a-zA-Z0-9@:%._+~#?&/=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%._+~#?&/=]*)

Required

This setting is required

Set this to the API URL that is used to obtain the total number of pages.

Total page API type

Configuration key

plugin.generate-start-urls.config.total-page-api-type

Data type

string

Allowed values

JSON,HTTP header,XML

Required

This setting is required

This option configures the type of data source from where the total number of pages will be read.

HTTP header field name

Configuration key

plugin.generate-start-urls.config.http-header-field-name

Data type

string

Value format

Allowed values must match the regular expression:

^[\w-]+$

Required

This setting is optional

Defines the HTTP header which contains the total number of pages, when using a HTTP header as the API type.

This field is required if the Total page API type is set to HTTP header.

JSON total pages field

Configuration key

plugin.generate-start-urls.config.json-path

Data type

string

Required

This setting is optional

Defines the JSON field (as a JSONPath expression) which contains the total number of pages, when using JSON as the API type.

Supports expression in dot–notation or bracket–notation. For example, if the JSON response is {"total_pages": 10}, the corresponding JSONPath expression should be $.total_pages or $['total_pages']. For more information on JSONPath, see RFC 9535.

This field is required if the Total page API type is set to JSON.

XML total pages field

Configuration key

plugin.generate-start-urls.config.xpath

Data type

string

Required

This setting is optional

Defines the XML field (as an XPath) which contains the total number of pages, when using XML as the API type.

Supports standard X-Path notation. For example, if the XML response is <results><total_pages>10</total_pages></results>, the XPath expression should be //total_pages. XPath 3.0 is supported

This field is required if the Total page API type is set to XML.

Start URL template

Configuration key

plugin.generate-start-urls.config.start-url-template

Data type

string

Value format

Allowed values must match the regular expression:

((http|https)://)(www.)?[a-zA-Z0-9@:%._+~#?&/=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%._+~#?&/=${}]*)

Required

This setting is required

Template used to define the start URLs that will be generated. ${page} in the template, will be replaced with the page number when generating the URL list.

For example, if the URL is http://example.com/page/1, the template should be http://example.com/page/${page}

Total page API authentication type

Configuration key

plugin.generate-start-urls.config.total-page-api-auth-type

Data type

string

Default value

None

Allowed values

None,HTTP Basic Authentication

Required

This setting is required

This option configures the API authentication type.

Only HTTP Basic authentication is supported.

HTTP basic authentication username

Configuration key

plugin.generate-start-urls.config.http-basic-auth-username

Data type

string

Required

This setting is optional

Defines the HTTP basic authentication username.

This field is required if the Total page API authentication type is set to HTTP Basic Authentication.

HTTP basic authentication password

Configuration key

plugin.generate-start-urls.encrypted.http-basic-auth-password

Data type

Encrypted string

Required

This setting is optional

Defines the HTTP basic authentication password.

This field is required if the Total page API authentication type is set to HTTP Basic Authentication.

Examples

Generate start URLs from a HTTP header

Assume https://www.example.com/tutorial-lists is an API that returns the total number of pages in the HTTP header field X-Total-Page-Count.

Configuration key name Value

Total page API

https://www.example.com/tutorial-lists

Total page API Type

HTTP header

HTTP header field name

X-Total-Page-Count

Start URL template

https://www.example.com/tutorial-list?page=${page}

If X-Total-Page-Count contains a value of 5, the following list will be generated as start URLs for data source:

  • https://www.example.com/tutorial-list?page=1

  • https://www.example.com/tutorial-list?page=2

  • https://www.example.com/tutorial-list?page=3

  • https://www.example.com/tutorial-list?page=4

  • https://www.example.com/tutorial-list?page=5

Generate start URLs from JSON

Assume https://api.example.com/tutorial-lists?format=json is an API that returns the total number of pages in the JSON response field totalPageCount as the following JSON response:

{
  "tutorials": [
    {
      "title": "Tutorial 1",
      "url": "https://www.example.com/tutorial-1"
    },
    {
      "title": "Tutorial 2",
      "url": "https://www.example.com/tutorial-2"
    }
  ],
  "totalPageCount": 5,
  "currentPage": 1
}
Configuration key name Value

Total page API

https://api.example.com/tutorial-lists?format=json

Total page API Type

JSON

JSONPath

$.totalPageCount or $['totalPageCount']

Start URL template

https://api.example.com/tutorial-lists?format=json&page=${page}

The following result will be generated as start URLs for data source:

  • https://api.example.com/tutorial-lists?format=json&page=1

  • https://api.example.com/tutorial-lists?format=json&page=2

  • https://api.example.com/tutorial-lists?format=json&page=3

  • https://api.example.com/tutorial-lists?format=json&page=4

  • https://api.example.com/tutorial-lists?format=json&page=5

Generate start URLs from XML

Assume https://api.example.com/tutorial-lists?format=xml is an API that returns the total number of pages in the XML response field totalPage as the following XML response:

<?xml version="1.0" encoding="UTF-8"?>
<tutorials>
    <tutorial>
        <title>Tutorial 1</title>
        <url>https://www.example.com/tutorial-1</url>
    </tutorial>
    <tutorial>
        <title>Tutorial 2</title>
        <url>https://www.example.com/tutorial-2</url>
    </tutorial>
    <totalPage>5</totalPage>
    <currentPage>1</currentPage>
</tutorials>
Configuration key name Value

Total page API

https://api.example.com/tutorial-lists?format=xml

Total page API Type

XML

JSONPath

//totalPage

Start URL template

https://api.example.com/tutorial-lists?format=xml&page=${page}

The following result will be generated as start URLs for data source:

  • https://api.example.com/tutorial-lists?format=xml&page=1

  • https://api.example.com/tutorial-lists?format=xml&page=2

  • https://api.example.com/tutorial-lists?format=xml&page=3

  • https://api.example.com/tutorial-lists?format=xml&page=4

  • https://api.example.com/tutorial-lists?format=xml&page=5

Change log

[1.1.0]

Added

  • Added support for HTTP basic authentication when downloading the start URLs.

Changed

  • Updated the wiremock to the latest version (3.9.1)