Include binary documents obtained by the web crawler in the search index
Funnelback will remove binary documents that it is unable to filter from the index by default. This is sometimes undesirable as you may wish the document’s URL to be displayed regardless, even if no useful text can be extracted.
The instructions below show how to include file types in the index that could not be filtered (converted to text). If you want to add a non-default file type see: Configuring Funnelback to index additional file types and only attempt the steps below if this does not work.
Process
Ensure that the documents are stored by the crawler
-
Ensure that the extensions are listed in the
crawler.non_html
data source configuration option:crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm,zip
-
remove the type from the
crawler.reject_files
data source configuration option:crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,wav,wmv,wrl,xpm
-
ensure that these documents are not filtered by adding the mime type to the
filter.ignore.mimeTypes
data source configuration option:filter.ignore.mimeTypes=application/zip
These instructions assume that the file type in question could not be filtered successfully.
Ensure the indexer does not discard binary documents
Add the -ibd
indexer option - this tells the indexer to include binary documents in the index. However when the index is built the indexer sets a flag in the index for each of these documents that prevents them from displaying - this flag needs to be removed.