Built-in filters - Convert binary documents to HTML (TikaFilterProvider)

This filter converts specific binary file formats to text using Apache Tika.

Supported formats

The search includes default support for the following binary file formats:

  • Portable Document Format (.pdf)

  • Microsoft Word (.doc, .docx)

  • Microsoft Excel (.xls, .xlsx)

  • Microsoft Powerpoint (.ppt, .pptx)

  • Rich text files (.rtf)

Apache Tika supports a huge and varied set of file formats and these can easily be extended by configuring your search to process specific additional supported formats.


For successful indexing a textual representation of the document must be produced by the filter.

The following documents cannot be indexed:

  • Password protected or encrypted documents (such as protected Microsoft Office and PDF documents)

  • Scanned PDF documents (to index a scanned document requires an OCR process to run over the documents)

Configure Tika to index additional supported file types

Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the $SEARCH_HOME/lib/java/all folder.

For formats supported by Tika see: Funnelback - Tika versions.

Add the file extensions of the additional file type to the filter.tika.types configuration option.



Once the extra type has been added some additional options may need to be set depending on the type of collection being indexed.

Collection type Collection configuration option Description



Ensure the file extension is not listed here



If used ensure the file extension is listed here



Ensure the file extension is listed here



Ensure the file extension is listed here



Ensure the file extension is listed here

See also: