Form interaction

Form interaction enables a search administrator to configure how Funnelback will behave when it encounters a specific web form (such as a login form) on a website.

This can be used to support SAML and other form-based authentication methods using cookies, allowing the web crawler to login to a secure area in order to crawl it.

There are two modes supported in this feature:

  1. Pre-crawl authentication

  2. In-crawl authentication

Form interaction is configured by setting a number of data source configuration settings that can be set from the data source configuration screen.

Pre-crawl authentication

When pre-crawl form interaction is configured the webcrawler logs in once at the start of the crawl in order to get a set of authentication cookies. Each configured form is accessed and the cookies cached by the web crawler and later used when accessing URLs.

Each pre-crawl form interaction consists of a set of configuration settings that are linked by a groupId.

You may not need to specify all form parameters, only those for which you need to give specific input values. The web crawler will parse the form and use any default values specified in the form (e.g. read from <input type="hidden"> fields) for the other parameters.

Once the forms have been processed any cookies generated are then used by the web crawler to authenticate its requests for content from the site.

Things to note

  • You may need to add an exclude_pattern for any logout links so the crawler does not log itself out when it starts crawling the authenticated content.

  • You may need to manually specify a parameter if it is generated by Javascript as the crawler does not interpret Javascript in forms.

  • You may need to set the value of crawler.user_agent to an appropriate browser user agent string in order to login to some sites.

  • You may need to specify an appropriate interval to renew the authentication cookies by re-running the form interaction process periodically.

Any cookie collected during the authentication process will be set in a header for every request the crawler will make during the crawl. However, the crawler.accept_cookies setting is still effective: If you disable it only the authentication cookie will be set, and if you enable it the crawler will collect cookies during the crawl in addition to the authentication cookie.

Depending on the site you are trying to crawl you may need to turn off general cookie processing to get authentication to work. This might be the case if the site being crawled causes the required authentication cookies to be overwritten. You can avoid this by setting crawler.accept_cookies to false.

In-crawl authentication

In-crawl form interaction configures the web crawler to submit forms that match any configured in-crawl form interaction settings whenever a form with a specific action is encountered.

Configuration for in-crawl form interaction is similar to pre-crawl form interaction. The main difference in configuration is that the URL is for the form action rather than the URL of the page that contains the form.

Each in-crawl form interaction consists of a set of configuration settings that are linked by a groupId.

e.g. for the following configuration for sample:

  • crawler.form_interaction.in_crawl.sample.url_pattern: https://sample.com/auth/login.jsp

  • crawler.form_interaction.in_crawl.sample.cleartext.user: john

  • crawler.form_interaction.in_crawl.sample.encrypted.password: 1234

If the crawler parses a form that results that matches the https://sample.com/auth/login.jsp action during the crawl then it submits the form with the given values (in this case user and password). This simulates the behaviour of a human who browses to password protected content and is asked to authenticate using a form which submits the form details to login.jsp. It also handles the situation where there may be a series of login/authentication URLs and redirects - as long as the crawler eventually downloads HTML containing the required form action then it will submit the required credentials.

Assuming the specified credentials are correct and the login (and subsequent redirects) succeed then the authenticated content will be downloaded and stored using the original URL that was requested. Any authentication cookies generated during this process will be cached so that subsequent requests to that site do not require the webcrawler to login again.

Logging

All log messages relating to form interaction will be written at the start of the main gather.log file in the offline or live log directory for the data source in question.

You can use the administration interface log viewer to view this file and debug issues with form interaction if required.

Debugging

Funnelback crawler debug API

Funnelback provides a crawler debug API call which can display a trace of the requests the crawler sends and the responses it receives while crawling a single URL.

Please note that because passwords can be revealed in the requests this endpoint requires access to the data source and the sec.administer.system permission.

The debug API can be accessed interactively from the administration interface by selecting API UI from the system menu, accessing the admin API then browsing to the crawler debug API call.

Debug using a web browser

In order to debug login issues you may need to look at how a browser logs in to the site in question. You can do this by using your web browser’s built-in developer tools to look at the network requests and responses (including cookie information) that gets transmitted when you manually log in to the site in the browser.

You can then compare this network trace with the output from the crawler debug API, or by looking in the gather.log. Some sample output is shown below:

Requested URL: https://identity.example.com/opensso/UI/Login

POST Parameters:
name=goto, value=http://my.example.com/user/
name=IDToken2, value=username
name=IDToken1, value=1234

POST Status: Moved Temporarily

In this example comparing the POST parameters with that in the browser trace showed that the goto parameter was different. Investigation of the HTML source of the login form showed that this parameter was being generated by some Javascript.

Since the crawler will not interpret Javascript we would then need to explicitly add this parameter to the configuration for your form interaction.

Troubleshooting tips

  • Try to log in to the form with your browser, but with Javascript disabled. If that doesn’t work then the crawler won’t be able to process the form as it relies on Javascript execution.

  • Make sure all form parameters are accounted for. Some backends like ASPX applications expects all parameters to be present in the submitted form, including parameters that look irrelevant to the authentication such as submit buttons values.

  • Make sure the crawler does not send extra empty parameters. For example if your form has two submit inputs login and cancel, the value for cancel should not be sent when the form is submitted. A regular browser will not send the value because the cancel button is not clicked during login but the crawler must be specifically told to not send this value by setting the parameter to an empty value in the form interaction configuration.

Caveats

The use of form interaction:

© 2015- Squiz Pty Ltd