Built-in filters: TikaFilterProvider
This filter converts specific binary file formats to text using Apache Tika.
Funnelback includes default support for the following binary file formats:
- Portable Document Format (
- Microsoft Word (
- Microsoft Excel (
- Microsoft Powerpoint (
- Rich text files (
Apache Tika supports a huge and varied set of file formats and these can easily be extended by configuring Funnelback to process specific additional supported formats.
For successful indexing a textual representation of the document must be produced by the filter.
The following documents will not be indexable by Funnelback:
- Password protected or encrypted documents (such as protected Microsoft Office and PDF documents)
- Scanned PDF documents (to index a scanned document requires an OCR process to run over the documents)
Configure Tika to index additional supported file types
Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the
For formats supported by Tika see: Funnelback - Tika versions.
Add the file extensions of the additional file type to the filter.tika.types configuration option.
Once the extra type has been added some additional options may need to be set depending on the type of collection being indexed.
|Collection type||Collection configuration option||Description|
|web||crawler.reject_files||Ensure the file extension is not listed here|
|web||crawler.accept_files||If used ensure the file extension is listed here|
|web||crawler.non_html||Ensure the file extension is listed here|
|filecopy||filecopy.filetypes||Ensure the file extension is listed here|
|trimpush||trim.extracted_file_types||Ensure the file extension is listed here|