predirects.cfg

predirects.cfg is used for declaring, ahead-of-time, redirects that the web crawler should follow while involved in form interaction.

It is most commonly used to progress past terms and conditions and confirmation pages that appear immediately after the crawler has logged into a site using form interaction.

The confirm button on these kinds of pages is generally unsuitable for storing on the frontier and crawling later, as it may have expired.

If the confirmation page you are trying to click through is a regular HTML Form, then you do not need predirects, and should simply declare another rule in form interaction. Predirects is only for links/buttons whose target can be scraped off the page using a regex.

Setting up

To enable:

File format

Comment lines (beginning with #) are ignored.

Configuration is done with pairs of lines, where:

  • the first of each is a regex to match against a URL which we’re on, and

  • the second of each is a regex to capture the URL which we’d like the crawler to follow immediately. This regex must contain one capture group (parentheses) identifying where the URL should be scraped from.

E.g.

predirects.cfg
https://www.example.com/confirm
<a[^>]*id="clickhere"[^>]*href="([^"]*)">

https://www.elsewhere.com/authorize
<a[^>]*class="button"[^>]*href="([^"]*)">

In the first example, if the crawler found itself at https://www.example.com/confirm during form interaction, it would look in that page’s source to try to match the regex, i.e. an anchor tag with id "clickhere", and a href with double-quotes around its value. The parentheses (capture group) immediately inside the quotes means the crawler will discard the quotes when reading the URL value.

See also