Include binary documents obtained by the web crawler in the search index

Funnelback will remove binary documents that it is unable to filter from the index by default. This is sometimes undesirable as you may wish the document’s URL to be displayed regardless, even if no useful text can be extracted.

The instructions below show how to include file types in the index that could not be filtered (converted to text). If you want to add a non-default file type see: Configuring Funnelback to index additional file types and only attempt the steps below if this does not work.

Process

Ensure that the documents are stored by the crawler

  1. Ensure that the extensions are listed in the crawler.non_html data source configuration option:

    crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm,zip
  2. remove the type from the crawler.reject_files data source configuration option:

    crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,wav,wmv,wrl,xpm
  3. ensure that these documents are not filtered by adding the mime type to the filter.ignore.mimeTypes data source configuration option:

    filter.ignore.mimeTypes=application/zip
    These instructions assume that the file type in question could not be filtered successfully.

Ensure the indexer does not discard binary documents

Add the -ibd indexer option - this tells the indexer to include binary documents in the index. However when the index is built the indexer sets a flag in the index for each of these documents that prevents them from displaying - this flag needs to be removed.

Ensure the query processor is configured to return binary documents

Add the -binary=0 option to the query_processor_options in your results page configuration.