Specifies a URL or URL pattern of the page containing the HTML web form in in_crawl mode.
Can be set in: collection.cfg
This parameter specifies the URL or URL pattern for the form action (processing end-point), instead of the URL of the form itself
In in_crawl mode the webcrawler submits form details during a crawl whenever a form with a specific "action" is encountered.
The values which should be passed to the form can be specified using either crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey or crawler.form_interaction.in_crawl.groupId.encrypted.urlParameterKey keys.
If the crawler parsed a form that resulted in the same absolute action URL during the crawl it would submit the specified values. This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to "login.jsp". It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.
Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.
- If you are using in-crawl authentication then the first field in the configuration file must be the absolute URL for the entity processing the form submission.
- A limitation of the current implementation is that only one in-crawl "form action" can be configured,
which means that only the last action target found in the
form_interaction.cfgfile will be used i.e. you should only have one non-comment line in your form_interaction.cfg file if using the in-crawl mode.
All log messages relating to form interaction will be written at the start of the main crawl.log file in the offline or live "log" directory for the collection in question.
You can use the administration interface log viewer to view this file and debug issues with form interaction if required.
In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using a tool like:
to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.
You can then compare this network trace with the output in the
crawl.log file. Some sample output
is shown below:
Requested URL: https://identity.example.com/opensso/UI/Login POST Parameters: name=goto, value=http://my.example.com/user/ name=IDToken2, value=username name=IDToken1, value=1234 POST Status: Moved Temporarily
Funnelback also provides a crawler debug API call which can display the requests the crawler would send and the responses it receives while crawling a single URL.
- Make sure all parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication, such as submit buttons values.
- Make sure the crawler doesn't send extra empty parameters. For example if your form has two submit
inputs "Login" and "Cancel", the value for "Cancel" should not be sent when the form is submitted.
A regular browser will not send the value because the Cancel button is not clicked during login (Only
the "Login" button is), but the crawler must be specifically told to not send this value by setting
the parameter to an empty value in the
collection.cfgfile (see instructions crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey).
None. No in_crawl urls are configured by default.
To specify a url with forms in in_crawl mode
Crawler groups the in_crawl authentication configuration for a given url by matching the
If you need to specify url parameters for the url
https://www.example.com/login.jsp then the
in both keys should be same. Which is
1 in the below example.