Feeds

in

Introduction

Funnelback's normal method of operation is an information 'pull'. Funnelback will initiate data gathering operations itself (e.g. start a web crawl). The feeds mechanism provides an interface by which external systems can explicitly 'push' information into Funnelback collections without relying on a user to navigate through the administration interface.

This feature will be especially useful to administrators who have new data that will arrive sporadically but is still important enough that it must get into search indexes quickly (e.g. Web based marketplaces). It will also be useful for developers of applications built on top of Funnelback, or wrappers around it.

Feeds Tasks

For simplicity, this section will refer to ‘resources’ whenever it is discussing web pages, files on a file share, database records etc..., and it will refer to collections of resources as ‘sites’ (even though these might be web sites, directories, databases etc.).

The feeds feature will allow outside programs to perform the following tasks:

Interface

The Feeds interface is a web based interface that allows outside programs to post XML based instructions describing actions that should be taken by Funnelback to update its indexes. Instructions are described in an XML format and sent to the Funnelback server. Each feed instruction set should be sent to your Funnelback server as a POST request with the query string feed=<xml document> where <xml document> is the xml document (the actual data, not a filename) that contains the feed instructions. All feed instructions should be sent to:

https://<your funnelback server>:<your admin UI port>/search/admin/handle-feed.cgi

e.g.

https://search.company.com:8443/search/admin/handle-feed.cgi

Please note that as this interface is usually protected via a login and encrypted channel, your application that is using feeds must also be able to login and use the HTTPS protocol.

Feed format

The feeds instruction format can be best described by looking at an example. The following example may look complicated, but it will be broken down into its elements further down.

 
 <funnelbackfeed>
  <feed>
    <header>
      <collection>mywebsite</collection>
    </header>

    <group action="addResource">
      <resource url="http://mywebsite.com/apage.html" />
      <resource url="http://mywebsite.com/anotherpage.html" type="inline">
        &lt;html&gt;
          &lt;head/&gt;
          &lt;body&gt;
            Some text
          &lt;/body&gt;
        &lt;/html&gt;
      </resource>
    </group>

    <group action="deleteResource">
      <resource url="http://mywebsite.com/anoldpage.html" />
      <resource url=... />
    </group>

    <group action="crawlSite">
      <site start_url="http://wwww.domain.com/seed.html" include_patterns="domain.com,secure.com" exclude_patterns="cgi-bin,calendar" />
      <site ... />
    </group>

    <group action="addSite">
      <site start_url="www.domain.com" include_patterns="domain.com,secure.com" exclude_patterns="" />
      <site ... />
    </group>

    <group action="removeSite">
      <site start_url="home.oldsite.com" include_patterns="oldsite.com" exclude_patterns="" />
      <site ... />
    </group>

    <group>
      ...
    </group>
  </feed>

  <feed>
    <header>
      <collection>another-name-here</collection>
    </header>

    <group>
      ...
    </group>
  </feed>

  <feed>
    ...
  </feed>

 </funnelbackfeed>

The first thing to note is that the feed document contains a root element named 'funnelbackfeed' and one or more 'feed' elements.

 <feed>
   <header>
     <collection>mywebsite</collection>
   </header>
   <group action="...">
     ...
   </group>
   <group action="...">
     ...
   </group>
 </feed>

The 'collection' element contained within the 'header' element for each feed tells Funnelback which collection this feed should operate on. Other than the 'header', each feed contains one or more action groups.

   <group action="addResource">
     <resource url="http://mywebsite.com/apage.html" />
     <resource url="http://mywebsite.com/anotherpage.html" type="inline">
       &lt;html&gt;
         &lt;head/&gt;
         &lt;body&gt;
           Some text
         &lt;/body&gt;
       &lt;/html&gt;
   </group>

An 'addResource' action group contains a list of resources to add to the collection in this feed. There are two ways that resources can be specified here:

   <group action="deleteResource">
     <resource url="http://mywebsite.com/anoldpage.html" />
   </group>

A 'deleteResource' action group causes resources with the given URLs to be immediately removed from the search indexes.

   <group action="crawlSite">
     <site start_url="http://wwww.domain.com/seed.html" include_patterns="domain.com,secure.com" exclude_patterns="cgi-bin,calendar" />
   </group>

A 'crawlSite' action group causes Funnelback to immediately go out and crawl the specified website and add its contents to the collection in question. Each 'site' element in this group must contain a start_url attribute. It is highly recommended to contain an 'include_patterns' attribute as well. It may also contain an 'exclude_patterns' attribute.

   <group action="addSite">
     <site start_url="www.domain.com" include_patterns="domain.com,secure.com" exclude_patterns="" />
   </group>
   <group action="removeSite">
     <site start_url="home.oldsite.com" include_patterns="oldsite.com" exclude_patterns="" />
   </group>

The 'addSite' and 'removeSite' action groups allow you to control which sites constitute a collection's source data. Adding and removing sites using this mechanism will only affect the collection's data content at the time of the next crawl. Note that this is entirely different to the 'crawlSite' group type, which will affect the content of the collection up until the next crawl.

Feeds tasks and collection types

Not all of the feed tasks will make sense for every collection type. The following is a table describing which tasks will be available for each collection type:

Task/Collection type Web Filecopy Database Local Meta
Add new content Y Y N N N
Add new content (inline) Y Y Y N N
Remove existing content Y Y Y N N
Update site Y Y N N N
Add Site Y Y N N N
Remove Site Y Y N N N

See Also