Configuring Funnelback to index additional file types
Background
This article outlines the steps required to configure Funnelback to index binary filetypes that are not included as part of the standard set of binary documents.
Details
By default Funnelback supports the indexing of HTML, Microsoft Office (Word/Excel/Powerpoint), RTF and text documents out of the box.
The binary formats are converted to text using Apache Tika - which supports a large number of document formats.
These formats can easily be added to the filetypes indexed by Funnelback as long as Tika can process the file format.
When considering additional formats to index remember that Funnelback can only index text that is extracted from the file which limits the useful set of file formats to add to the search. For many file formats document metadata will be the only useful text that can be extracted.
Tika’s list of supported formats changes regularly so be sure to check the correct Tika version for the version of Funnelback that will be installed.
The first approach discussed below does not convert the documents to text but uses metadata to describe the binary documents. The other approaches use Tika and external filtering in order to exrtact text from the binary documents.
Indexing non-textual files
For Funnelback to successfully index a document it needs to have a textual representation of the document. For text-based documents such as PDFs or Microsoft Word documents filtering is used to extract the text contained within the document and this is what Funnelback indexes.
For other files types such as multimedia types (e.g. images, movies, sound files) the filtering will only extract any metadata that has been embedded within the files and this will normally not be anything useful as the embedded metadata is usually attributes about the file such as the bit rate, duration, camera used to take a photo etc.
The best approach for indexing non-textual files is to index text that has been written to describe the files and associating it with the file’s URL. For a sound file or movie index a transcipt or write descriptive metadata such as a title, description and keywords which can then be used as the text that describes the file.
When you use this approach to index non-textual documents the actual files themselves do not need to be downloaded by Funnelback (e.g. if you’re doing a web crawl these can be in the exclude list). This is because the files themselves are not indexed - Funnelback is using the XML record to index the file and then attaching the file’s URL to the search result. |
Tutorial: Index non-textual files using an XML file listing
This tutorial shows how to use a simple XML file to describe a number of non-text files to add them to the search index.
Consider a site that has 3 non-text files:
-
An image (
shakespeare.jpg
) containing a picture of William Shakespeare -
A sound file (
hamlet.mp3
) containing a radio performance of Hamlet -
A video file (
lear.mov
) of a performance of King Lear
Steps
-
Produce an XML file containing all the useful fielded information describing your files. This could be produced manually, or automatically generated from metadata/database information. e.g.
<?xml version="1.0" encoding="UTF-8" ?> <files> <file> <title><![CDATA[Chandos portrait]]></title> <uri>http://shakespeare.example.com/images/shakespeare.jpg</uri> <description><![CDATA[The Chandos portrait is the most famous of the portraits that may depict William Shakespeare. Painted between 1600 and 1610, it may have served as the basis for the engraved portrait of Shakespeare used in the First Folio in 1623. It is named after the Dukes of Chandos, who formerly owned the painting.]]></description> <author><![CDATA[John Taylor]]></author> <date>1610</date> <location><![CDATA[National Portrait Gallery, London]]></location> <keywords> <keyword><![CDATA[John Taylor]></keyword> <keyword><![CDATA[William Shakespeare]></keyword> <keyword><![CDATA[painting]></keyword> </keywords> <filetype>Image</filetype> <filesize>820kB</filesize> <format>jpg</format> </file> <file> <title><![CDATA[Hamlet]]></title> <uri>http://shakespeare.example.com/radio/hamlet.mp3</uri> <description><![CDATA[A full-text radio production of the play, co-produced by the BBC and the Renaissance Theatre Company. Features Kenneth Branagh as Hamlet, Derek Jacobi and Claudius, Judi Dench as Gertrude, and John Gielgud as the Ghost.]]></description> <author><![CDATA[William Shakespeare]]></author> <author><![CDATA[Kenneth Branagh]]></author> <author><![CDATA[Derek Jacobi]]></author> <author><![CDATA[Judi Dench]]></author> <author><![CDATA[Renaissance Theatre Company]]></author> <author><![CDATA[British Broadcasting Corporation]]></author> <date>1992</date> <keywords> <keyword><![CDATA[William Shakespeare]]></keyword> <keyword><![CDATA[audio]]></keyword> <keyword><![CDATA[radio]]></keyword> <keyword><![CDATA[BBC Radio 3]]></keyword> </keywords> <duration>235</duration> <duration_units>min</duration_units> <filetype>Sound recording</filetype> <filesize>399.7MB</filesize> <format>mp3</format> </file> <file> <title><![CDATA[King Lear]]></title> <uri>http://shakespeare.example.com/video/lear.mov</uri> <description><![CDATA[King Lear is a 2018 British-American television film directed by Richard Eyre. An adaptation of the play of the same name by William Shakespeare, cut to just 115 minutes, was broadcast on BBC Two on 28 May 2018.]]></description> <author><![CDATA[William Shakespeare]]></author> <author><![CDATA[Richard Eyre]]></author> <author><![CDATA[Jim Broadbent]]></author> <author><![CDATA[Jim Carter]]></author> <author><![CDATA[Tobias Menzies]]></author> <author><![CDATA[Emily Watson]]></author> <author><![CDATA[John Macmillan]]></author> <author><![CDATA[Florence Pugh]]></author> <author><![CDATA[Emma Thompson]]></author> <author><![CDATA[Anthony Calf]]></author> <author><![CDATA[Anthony Hopkins]]></author> <author><![CDATA[Simon Manyonda]]></author> <author><![CDATA[Chukwudi Iwuji]]></author> <author><![CDATA[Karl Johnson]]></author> <author><![CDATA[Samuel Valentine]]></author> <author><![CDATA[Andrew Scott]]></author> <author><![CDATA[Christopher Eccleston]]></author> <date>2018</date> <keywords> <keyword><![CDATA[William Shakespeare]]></keyword> </keywords> <duration>115</duration> <duration_units>min</duration_units> <filetype>Video recording</filetype> <filesize>4.8GB</filesize> <format>mov</format> </file> </files>
-
Make the XML available at a web accessible address. (e.g.
http://shakespeare.example.com/files.xml
) -
Ensure that the XML file is included in your search. e.g. for a web collection you could add the XML’s URL to your start URLs.
-
Update your search collection.
-
Set the following XML processing options (Note the paths here are specific to the example XML above). This will split the XML document into multiple records, and assign the URL and filetype based on the contents of specified fields in the XML.
-
XML document splitting:
/files/file
-
Document URL:
/files/file/uri
-
Document filetype:
/files/file/format
-
-
Create metadata mappings for all of the fields that you wish to include in the index. e.g.
-
t:
//title
-
author:
//author
-
etc.
-
-
Re-index the live view to incorporate the metadata.
-
At this point you should see the additional results appearing in your search results. You will need to modify your template to display the result appropriately.
Add an additional filetype that is supported by Tika
The steps for adding additional filetypes vary depending on the collection type being used.
Example: Add additional tika-supported filetypes to web collections
-
Ensure the filetype extension is not present in the
crawler.reject_files
list. The default value incollection.cfg
is:crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip
-
Set the parser mime types. If you wish links to be extracted (for crawl purposes) from the document then ensure the mime type is listed in the
crawler.parser.mimeTypes
list. note: only text documents should be listed here. The default value incollection.cfg
is:crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml
-
Configure the non html files list. Add the file extension of the new filetype to the
crawler.non_html
list. The default value incollection.cfg
is:crawler.non_html=pdf,doc,ps,ppt,xls,rtf
-
Set the Tika processed file types. Check that the file extension is listed in the
filter.tika.types
list. The default value incollection.cfg
is:filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
-
Run a full update of the collection.
Example: Add additional tika-supported filetypes to filecopy collections
-
Add the file extension of the new filetype to the filecopy.filetypes list. The default value in
collection.cfg
is:filecopy.filetypes=doc,docx,rtf,pdf,html,xls,xlsx,txt,htm,ppt,pptx
-
Set the Tika processed file types. Check that the file extension is listed in the
filter.tika.types
list. The default value incollection.cfg
is:filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
-
Run a full update of the collection.
Example: Add additional tika-supported filetypes to trimpush collections
-
Add the file extension of the new filetype to the
trim.extracted_file_types
list. The default value incollection.cfg
is:trim.extracted_file_types=*,doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,txt,htm,html,jpg,gif,tif,vmbx
-
Set the Tika processed file types. Check that the file extension is listed in the
filter.tika.types
list. The default value incollection.cfg
is:filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
-
Run a full update of the collection.
Example: Add additional tika-supported filetypes to other collection types
this applies to other collection types except for local collections which do not filter binary documents. |
-
Set the Tika processed file types. Check that the file extension is listed in the
filter.tika.types
list. The default value incollection.cfg
is:filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
-
Run a full update of the collection.
Example: Add an additional filetype using an external converter
The use of external converters is generally discouraged as there may be a significant impact on performance as a separate system process is run for each document that is being filtered. |
-
Install binaries. Ensure any extra binaries are installed onto the Funnelback server and made executable by the search user (or relevant Windows user account used to run updates).
-
Add any new binaries to
executables.cfg
and create atextify.cfg
containing extension to command mappings. -
Ensure that
ExternalFilterProvider
is included in the filter chain for the collection. The default value incollection.cfg
is:filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
-
Ensure that the filetype is added to the acceptable files for the collection using the Tika instructions above (web:
crawler.non_html
and optionallycrawler.parser.mimeTypes
; filecopy:filecopy.filetypes
; TRIM/HP RM:trim.extractedfile_types
) -
If the external filter is overriding Tika then ensure that the file extension is removed from
filter.tika.types
. -
Run a full update of the collection.