Debugging form-based authentication

This article applies to form-based authentication (form interaction) for the web crawler. It should not be confused with other forms of authentication support by the web crawler (such as HTTP basic authentication and NTLM authentication).

This article provides guidance on debugging issues associated with form-based authentication.

Funnelback can be configured to perform form-based authentication when connecting to a website by configuring form interaction. Form-based authentication covers situations where a user has to fill in a HTML form to log in to a website. The underlying authentication mechanism can vary and may include types of authentication such as SAML.

Form-based authentication can be quite tricky to configure and is notoriously difficult to debug because there is often a Javascript layer built in to the form that the implementer will need to account for when configuring the form interactions and various redirects that may occur during the authentication.

The best way to troubleshoot form-based authentication is to use Funnelback’s debug API, which will request a specified URL and show all the request and response headers, returned data and redirects that occur when the request is made.

This tutorial assumes that form interaction has been setup for the data source.

  1. Log in to the search dashboard and select View API UI from the menu.

  2. The Admin API calls will be listed. Scroll down then expand the debug section.

  3. The http-request call allows you to debug the set of requests that occur when Funnelback requests a URL listed inside the collection configuration using either a crawler.form_interaction.in_crawl.<GROUP-ID>.url_pattern or crawler.form_interaction.pre_crawl.<GROUP-ID>.url setting.

  4. Click on the GET /crawler/v1/debug/collections/<collection>/http-request heading to expand the API test form. Fill in the form with the data source (collection) id (this will cause the API to load the form interaction configuration from the specified data source), the url to test (this should be the URL that you used in your data source configuration) and then select the level to debug. The different level values will provide varying degrees of information. BODY (the default value) provides the most information - including request and response headers as well as the response data. It is often best to start with BASIC and gradually increase the level as this will give insight into redirects that may be occurring and the values of any cookies returned in the HTTP headers.

The information that is returned by the debug call will provide an insight into what will need to be changed.

Common problems with form interaction include:

  • Javascript that builds up a request. An implementer will need to figure out what the request is that is built and put the resulting request URL into the form interaction configuration.

  • Circular redirects that may occur when requesting the URL. It may be that some of the redirects also rely on Javascript which again must be worked around. It may also be that the URL that was set for form interaction needs to be changed to another URL that results as part of some of the Javascript processing.

See also