Crawling Password Protected Sites

Introduction

Some websites are protected by an authentication scheme which requires a username/password combination to access the site. In order for Funnelback to successfully crawl password protected sites, it must be given a valid user name and password to use.

The authentication schemes that Funnelback currently supports are:

  • HTTP Basic Authentication
  • Windows Integrated Authentication (NTLM)

Giving Funnelback a username and password

Funnelback supports multiple username/password pairs per collection. If you have a single account to configure you can set the values using parameters in a collection's collection.cfg file. To allow Funnelback access to the protected website:

  • Set the http_user parameter to a valid username.
  • Set the http_passwd parameter to the username's password.

For a web collection these settings can be accessed by going to the "Administer" tab, clicking on "Edit Collection Settings" and then going to the "HTTP" tab. The fields to edit are called "HTTP username" and "HTTP password".

When specifying a username for crawling an NTLM-protected site you will probably need to specify a domain. The format for this type of username must be in the form:

    DOMAIN\username

i.e. user@domain is not supported.

Specifying Multiple Usernames and Passwords

If you need to specify multiple accounts for different web servers you can configure this using the Site Profiles mechanism.

Selecting an authentication scheme

The correct authentication scheme to be used (either Basic HTTP or Windows Integrated (NTLM)) will depend on the site you are crawling. It can be selected on the Administration UI by going to the "Administer" tab, clicking on "Edit Collection Settings" and then going to the "HTTP" tab and selecting the appropriate radio button under the 'Authentication Type' option.

This will have have the effect of setting the HTTP library option in the collection's collection.cfg file. (Since different libraries are used to support the different authentication types) i.e.

For Basic HTTP authentication:
crawler.packages.httplib=HTTPClient



For Windows Integrated (NTLM):
crawler.packages.httplib=JavaHttp

Note: The JavaHttp library does not currently support the use of cookies - you will need to use the HTTPClient library to enable the use of cookies.

NTLM and Windows 2008 R2 / Seven

Due to changes in NTLM authentication in Windows Seven and Server 2008 R2 the crawler won't be able to authenticate using NTLM when:

  • It's run from the Admin UI as part of a collection update and
  • The Funnelback Windows service runs under a "Local System Account"

In this specific case, to have NTLM authentication working, you should either:

  • Run the Funnelback Windows service as a domain user (Start -> Administrative Tools -> Services -> Right-click on "Funnelback-jetty-service" -> Log On tab)
  • Set specific policies (Start -> Administrative Tools -> Local Security Policy -> Local Policies -> Security Options)
    • Set Network Security: Allow Local System to use computer identify for NTLM to Disabled
    • Set Network Security: Allow LocalSystem NULL session fallback to Enabled
  • Run the update using the command line, as a domain user.

See Also

top