Include binary documents obtained by the web crawler in the search index
Funnelback will remove binary documents that it is unable to filter from the index by default. This is sometimes undesirable as you may wish the document’s URL to be displayed regardless, even if no useful text can be extracted.
The instructions below show how to include file types in the index that could not be filtered (converted to text). If you want to add a non-default file type see: Configuring Funnelback to index additional file types and only attempt the steps below if this does not work.
Process
Ensure that the documents are stored by the crawler
-
Ensure that the extensions are listed in the
crawler.non_htmldata source configuration option:crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm,zip -
remove the type from the
crawler.reject_filesdata source configuration option:crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,wav,wmv,wrl,xpm -
ensure that these documents are not filtered by adding the mime type to the
filter.ignore.mimeTypesdata source configuration option:filter.ignore.mimeTypes=application/zipThese instructions assume that the file type in question could not be filtered successfully.
Ensure the indexer does not discard binary documents
Add the -ibd indexer option - this tells the indexer to include binary documents in the index. However when the index is built the indexer sets a flag in the index for each of these documents that prevents them from displaying - this flag needs to be removed.