Include binary documents in the search index

Background

Funnelback will remove binary documents that it is unable to filter from the index by default. This is sometimes undesirable as you may wish the document’s URL to be displayed regardless, even if no useful text can be extracted.

Process

It is a little complicated, but possible to achieve this outcome.

Ensure that the documents are stored by the crawler

Once in the crawler some changes are required to the indexing process.

  1. For a web collection ensure that the extensions are listed in the crawler.non_html collection.cfg config option:

    crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm,zip
  2. remove the type from the crawler.reject_files collection.cfg option:

    crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,wav,wmv,wrl,xpm
  3. ensure that these documents are not filtered by adding the mime type to the filter.ignore.mimeTypes collection.cfg option:

    filter.ignore.mimeTypes=application/zip

Ensure the indexer does not discard binary documents

Add the -ibd indexer option - this tells the indexer to include binary documents in the index. However when the index is built the indexer sets a flag in the index for each of these documents that prevents them from displaying - this flag needs to be removed.

To do this you first need to generate a list of URLs to apply this removal to. If you want all binary documents in the index to be visible then you can run something like

$ $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index -show > $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt

The following command then removes the binary document flag from the index

$ $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt -bits 17f AND

This can be done automatically by adding a post_index_command to your collection.cfg:

post_index_command=$SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index -show > $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt && $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/
data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt  -bits 17f AND