Web archive (WARC)

This component is currently in the Beta phase of development. We encourage users to submit feedback, and we will be prioritizing fixes should any issues be encountered.

This is a Premium connector. Only customers with a subscription to the web archiving application will be supplied with the required credentials by the Integrations team.

Credentials

Name

The name of your credential

URL

The URL of the SQS Queue for this customer’s tenant. This will be provided by the Integrations team when the tenant is deployed.

Access key ID

The ID of the AWS access key for the tenant. This will be provided by the Integrations team when the tenant is deployed. Do not share this with anyone outside of the Integrations and Implementation teams.

Secret access key

The secret access key for the tenant. This will be provided by the Integrations team when they deploy the tenant. Do not share this with anyone outside of the Integrations and Implementation teams.

Region

The AWS region in which the tenant is deployed. This can be derived from the Queue URL. For example; ap-southeast-2

Client name

The name of the client. This will be used across the tenant and Connect instance and should be agreed upon with the Integrations team before they deploy the tenant.

Actions

Send message

The only Action available to the component. This will send a message to the customer’s tenant’s SQS queue containing the data required to initialize the archiving process for a single page.

Input fields

An asterisk (*) denotes a required field.

Callback flow ID*

The ID of the Connect flow #3 in your customer’s workspace. This is the flow to which each .warc file will be sent for Connect to forward to the customer’s chosen cloud storage solution.

URL to be archived*

The full customer-facing URL of the page that the archive process will visit to generate the .warc file.

Last modified date of the page

This can be configured when setting up the Matrix trigger.

Matrix asset ID of the page

This can be configured when setting up the Matrix trigger.

Additional data

Include any additional data here as a JSON object. This data will be added to the .warc file. Examples of additional data are description or collection.

See the .json object below for an example:

{"lastmodby": "editor", "description": "employment", "collection": "Archive Test Website", "disableJS": true }

The additional data JSON object is configured by default to accept the disableJS parameter.

If this is set to true, the archive process will disable JavaScript when it visits the page to archive. This can reduce the size of the generated .warc file if the page is JavaScript-heavy.

Avoid manually setting this value. Allow the Connect flow to extract it from the Webhook data. Steps 4 and 7 of flow #3 will automatically detect a .warc file that is too large and attempt to re-trigger flow #2 with the DisableJS value set to true.

The default value of false will be used if the field is not included, meaning JavaScript will be enabled for the archive.

Output

The action will output a JSON representation of the result of calling the SQS API to add a message to the queue. If successful, it will emit something similar to the following:

{
  "ResponseMetadata": {
    "RequestId": "351785b3-9d45-5639-8b26-ded1ed7ca8b6"
  },
  "MD5OfMessageBody": "f20eee3dbb643b7486c2da434467d7a6",
  "MD5OfMessageAttributes": "7a35f1c8c293a704405f778be2dcbf19",
  "MessageId": "6fbd9555-d8a4-41f8-9af0-f490ddb2c86e"
}

If unsuccessful, it will emit an error message.

Running the sample data process for this action will send a real message to the SQS Queue and thus trigger a page archive.