Form interaction enables a search administrator to configure how Funnelback will behave when it encounters a specific web form (such as a login form) on a website.
This can be used to support SAML and other form-based authentication methods using cookies, allowing the web crawler to login to a secure area in order to crawl it.
There are two modes supported in this feature:
Form interaction is configured by setting a number of data source configuration settings that can be set from the data source configuration screen.
When pre-crawl form interaction is configured the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookies. Each configured form is accessed and the cookies cached by the web crawler and later used when accessing URLs.
Each pre-crawl form interaction consists of a set of configuration settings that are linked by a groupId.
crawler.form_interaction.pre_crawl.[groupId].url: Specifies the URL of the page that contains the login form (not the form action).
crawler.form_interaction.pre_crawl.[groupId].form_number: Species which
<form>in the page to target. If there is only a single form in the page then set this to
1for the first form.
crawler.form_interaction.pre_crawl.[groupId].cleartext.[urlParameterKey]: Used to specify form parameters that will be stored and displayed in the administration interface in clear text.
crawler.form_interaction.pre_crawl.[groupId].encrypted.[urlParameterKey]: Used to specify form parameters that will be stored encrypted and displayed in the administration interface as a set of obfuscated characters.
You may not need to specify all form parameters, only those for which you need to give specific input values. The web crawler will parse the form and use any default values specified in the form (e.g. read from
<input type="hidden"> fields) for the other parameters.
Once the forms have been processed any cookies generated are then used by the web crawler to authenticate its requests for content from the site.
You may need to add an
exclude_patternfor any logout links so the crawler does not log itself out when it starts crawling the authenticated content.
You may need to set the value of
crawler.user_agentto an appropriate browser user agent string in order to login to some sites.
You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.
Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the
crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.
Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting
In-crawl form interaction configures the web crawler to submit forms that match any configured in-crawl form interaction settings whenever a form with a specific action is encountered.
Configuration for in-crawl form interaction is similar to pre-crawl form interaction. The main difference in configuration is that the URL is for the form action rather than the URL of the page that contains the form.
Each in-crawl form interaction consists of a set of configuration settings that are linked by a groupId.
crawler.form_interaction.in_crawl.[groupId].url_pattern: Specifies the URL of the form action. Any HTML encountered during the crawl which contains this form action will cause the crawler to submit the form details specified by the groupId.
crawler.form_interaction.in_crawl.groupId.cleartext.urlParameterKey: Used to specify form parameters that will be stored and displayed in the administration interface in clear text.
crawler.form_interaction.in_crawl.groupId.encrypted.urlParameterKey: Used to specify form parameters that will be stored encrypted and displayed in the administration interface as a set of obfuscated characters.
e.g. for the following configuration for sample:
If the crawler parses a form that results that matches the
https://sample.com/auth/login.jsp action during the crawl then it submits the form with the given values (in this case
password). This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to
login.jsp. It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.
Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.
All log messages relating to form interaction will be written at the start of the main
gather.log file in the offline or live log directory for the data source in question.
You can use the administration interface log viewer to view this file and debug issues with form interaction if required.
Funnelback provides a crawler debug API call which can display a trace of the requests the crawler sends and the responses it receives while crawling a single URL.
Please note that because passwords can be revealed in the requests this endpoint requires access to the data source and the
The debug API can be accessed interactively from the administration interface by selecting API UI from the system menu, accessing the admin API then browsing to the crawler debug API call.
In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using your web browser’s built-in developer tools to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.
You can then compare this network trace with the output from the crawler debug API, or by looking in the
gather.log. Some sample output is shown below:
Requested URL: https://identity.example.com/opensso/UI/Login POST Parameters: name=goto, value=http://my.example.com/user/ name=IDToken2, value=username name=IDToken1, value=1234 POST Status: Moved Temporarily
Make sure all form parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication such as submit buttons values.
Make sure the crawler does not send extra empty parameters. For example if your form has two submit inputs login and cancel, the value for cancel should not be sent when the form is submitted. A regular browser will not send the value because the cancel button is not clicked during login but the crawler must be specifically told to not send this value by setting the parameter to an empty value in the form interaction configuration.
The use of form interaction:
Overrides the value of the crawler.request_header setting if this has been specified.