Built-in filters - Convert binary documents to HTML (TikaFilterProvider)

This filter converts specific binary file formats to text using Apache Tika.

Supported formats

The search includes default support for the following binary file formats:

  • Portable Document Format (.pdf)

  • Microsoft Word (.doc, .docx)

  • Microsoft Excel (.xls, .xlsx)

  • Microsoft Powerpoint (.ppt, .pptx)

  • Rich text files (.rtf)

Apache Tika supports a huge and varied set of file formats and these can easily be extended by configuring your search to process specific additional supported formats.

Caveats

For successful indexing a textual representation of the document must be produced by the filter.

The following documents cannot be indexed:

  • Password protected or encrypted documents (such as protected Microsoft Office and PDF documents)

  • Scanned PDF documents (to index a scanned document requires an OCR process to run over the documents)

Configure Tika to index additional supported file types

Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the $SEARCH_HOME/lib/java/all folder.

For formats supported by Tika see: Funnelback - Tika versions.

Add the file extensions of the additional file type to the filter.tika.types configuration option.

Example

filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm

Once the extra type has been added some additional options may need to be set depending on the type of collection being indexed.

Collection type Collection configuration option Description

web

crawler.reject_files

Ensure the file extension is not listed here

web

crawler.accept_files

If used ensure the file extension is listed here

web

crawler.non_html

Ensure the file extension is listed here

filecopy

filecopy.filetypes

Ensure the file extension is listed here

trimpush

trim.extracted_file_types

Ensure the file extension is listed here

See also: