Specifies a URL of the page containing the HTML web form in pre_crawl mode.
Can be set in: collection.cfg
This parameter specifies the URL of a page which contains HTML web forms. not the action URL for the script which processes the form. This can be used to support form-based authentication using cookies, allowing the webcrawler to login to a secure area in order to crawl it.
In the pre crawl mode, the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookie(s). These can then be used during the crawl in order to get access to authenticated content.
The values which should be passed to the form can be specified using either crawler.form_interaction.pre_crawl.groupId.cleartext.urlParameterKey or crawler.form_interaction.pre_crawl.groupId.encrypted.urlParameterKey keys.
Things to note
- You may need to add an exclude_pattern for any "logout" links so the crawler does not log itself out when it starts crawling the authenticated content.
- You may need to set the value of crawler.user_agent to an appropriate browser user agent string in order to login to some sites.
- You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.
Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.
Note: Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting crawler.accept_cookies to 'false'.
None. No pre_crawl urls are configured by default.
To specify a url with forms in pre_crawl mode
Crawler groups the pre_crawl authentication configuration for a given url by matching the
If you need to specify a form number for the url
https://www.example.com/login then the
in both keys should be same. Which is
1 in the below example.