Crawling and indexing FTP sites

Background

Funnelback includes basic support for the crawling of FTP sites using the web crawler.

Caveats:

  • Crawling of FTP sites does not support document level security.

Method

  1. Create a web collection for the ftp site index

    • Configure the start URL to be the FTP site’s root page

    • Configure include/exclude patterns as for a standard web collection

    • Enable the ftp protocol by adding ftp to the crawler_protocols collection.cfg setting.

  2. Configure authentication

    • Set the ftp username and password configuration options in collection.cfg:

      ftp_passwd=<FTP-USERNAME>
      ftp_user=<FTP-PASSWORD>
  3. Configure filetypes and download sizes

    • The basic filetypes supported by web collections will be gathered. Additional filetypes can be added. See: Configure Funnelback to index additional file types

    • Set download and parser sizes using the crawler.max_download_size and crawler.max_parse_size settings.

  4. Crawl the site.