Built-in filters - Convert binary documents to HTML (TikaFilterProvider)
This filter converts specific binary file formats to text using Apache Tika.
Supported formats
The search includes default support for the following binary file formats:
-
Portable Document Format (
.pdf
) -
Microsoft Word (
.doc
,.docx
) -
Microsoft Excel (
.xls
,.xlsx
) -
Microsoft Powerpoint (
.ppt
,.pptx
) -
Rich text files (
.rtf
)
Apache Tika supports a huge and varied set of file formats and these can easily be extended by configuring your search to process specific additional supported formats.
Caveats
For successful indexing a textual representation of the document must be produced by the filter.
The following documents cannot be indexed:
-
Password protected or encrypted documents (such as protected Microsoft Office and PDF documents)
-
Scanned PDF documents (to index a scanned document requires an OCR process to run over the documents)
Configure Tika to index additional supported file types
Before going any further check the list of supported document types. When checking ensure you look at the correct version of Tika - you can find out the version by finding the Tika jar files that sit within the $SEARCH_HOME/lib/java/all
folder.
For formats supported by Tika see: Funnelback - Tika versions.
Add the file extensions of the additional file type to the filter.tika.types
configuration option.
Example
filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
Once the extra type has been added some additional options may need to be set depending on the type of collection being indexed.
Collection type | Collection configuration option | Description |
---|---|---|
web |
Ensure the file extension is not listed here |
|
web |
If used ensure the file extension is listed here |
|
web |
Ensure the file extension is listed here |
|
filecopy |
Ensure the file extension is listed here |
|
trimpush |
Ensure the file extension is listed here |