Include binary documents in the search index
Background
Funnelback will remove binary documents that it is unable to filter from the index by default. This is sometimes undesirable as you may wish the document’s URL to be displayed regardless, even if no useful text can be extracted.
Process
It is a little complicated, but possible to achieve this outcome.
Ensure that the documents are stored by the crawler
Once in the crawler some changes are required to the indexing process.
-
For a web collection ensure that the extensions are listed in the
crawler.non_html
collection.cfg
config option:crawler.non_html=doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,xlsm,zip
-
remove the type from the
crawler.reject_files
collection.cfg
option:crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,wav,wmv,wrl,xpm
-
ensure that these documents are not filtered by adding the mime type to the
filter.ignore.mimeTypes
collection.cfg
option:filter.ignore.mimeTypes=application/zip
Ensure the indexer does not discard binary documents
Add the -ibd
indexer option - this tells the indexer to include binary documents in the index. However when the index is built the indexer sets a flag in the index for each of these documents that prevents them from displaying - this flag needs to be removed.
To do this you first need to generate a list of URLs to apply this removal to. If you want all binary documents in the index to be visible then you can run something like
$ $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index -show > $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt
The following command then removes the binary document flag from the index
$ $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt -bits 17f AND
This can be done automatically by adding a post_index_command
to your collection.cfg
:
post_index_command=$SEARCH_HOME/bin/padre-fl $SEARCH_HOME/data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index -show > $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt && $SEARCH_HOME/bin/padre-fl $SEARCH_HOME/
data/$COLLECTION_NAME/$CURRENT_VIEW/idx/index $SEARCH_HOME/conf/$COLLECTION_NAME/binaryurls.txt -bits 17f AND