Configuring Funnelback to index additional file types

The web crawler is pre-configured to crawl and index a small set of document types that are commonly found on websites. This guide outlines ho to index binary filetypes that are not included as part of the standard set of binary documents.

Details

By default, Funnelback supports the indexing of HTML, Microsoft Office (Word/Excel/Powerpoint), RTF and text documents out of the box.

The binary formats are converted to text using Apache Tika - which supports a large number of document formats.

These formats can easily be added to the filetypes indexed by Funnelback as long as Tika can process the file format.

When considering additional formats to index remember that Funnelback can only index text that is extracted from the file which limits the useful set of file formats to add to the search. For many file formats document metadata will be the only useful text that can be extracted.

Tika’s list of supported formats changes regularly so be sure to check the correct Tika version for the version of Funnelback that will be installed.

The first approach discussed below does not convert the documents to text but uses metadata to describe the binary documents. The other approaches use Tika and external filtering in order to extract text from the binary documents.

Indexing non-textual files

For Funnelback to successfully index a document it needs to have a textual representation of the document. For text-based documents such as PDFs or Microsoft Word documents filtering is used to extract the text contained within the document and this is what Funnelback indexes.

For other files types such as multimedia types (e.g. images, movies, sound files) the filtering will only extract any metadata that has been embedded within the files and this will normally not be anything useful as the embedded metadata is usually attributes about the file such as the bit rate, duration, camera used to take a photo etc.

The best approach for indexing non-textual files is to index text that has been written to describe the files and associating it with the file’s URL. For a sound file or movie index a transcript or write descriptive metadata such as a title, description and keywords which can then be used as the text that describes the file.

When you use this approach to index non-textual documents the actual files themselves do not need to be downloaded by Funnelback (e.g. if you’re doing a web crawl these can be in the exclude list). This is because the files themselves are not indexed - Funnelback is using the XML record to index the file and then attaching the file’s URL to the search result.

Tutorial: Index non-textual files using an XML file listing

This tutorial shows how to use a simple XML file to describe a number of non-text files to add them to the search index.

Consider a site that has 3 non-text files:

  • An image (shakespeare.jpg) containing a picture of William Shakespeare

  • A sound file (hamlet.mp3) containing a radio performance of Hamlet

  • A video file (lear.mov) of a performance of King Lear

Steps
  1. Produce an XML file containing all the useful fielded information describing your files. This could be produced manually, or automatically generated from metadata/database information. e.g.

    <?xml version="1.0" encoding="UTF-8" ?>
    <files>
    	<file>
    		<title><![CDATA[Chandos portrait]]></title>
    		<uri>http://shakespeare.example.com/images/shakespeare.jpg</uri>
    		<description><![CDATA[The Chandos portrait is the most famous of the portraits that may depict William Shakespeare. Painted between 1600 and 1610, it may have
    served as the basis for the engraved portrait of Shakespeare used in the First Folio in 1623. It is named after the Dukes of Chandos, who formerly owned the
    painting.]]></description>
    		<author><![CDATA[John Taylor]]></author>
    		<date>1610</date>
    		<location><![CDATA[National Portrait Gallery, London]]></location>
    		<keywords>
    			<keyword><![CDATA[John Taylor]></keyword>
    			<keyword><![CDATA[William Shakespeare]></keyword>
    			<keyword><![CDATA[painting]></keyword>
    		</keywords>
    		<filetype>Image</filetype>
    		<filesize>820kB</filesize>
    		<format>jpg</format>
    	</file>
    	<file>
    		<title><![CDATA[Hamlet]]></title>
    		<uri>http://shakespeare.example.com/radio/hamlet.mp3</uri>
    		<description><![CDATA[A full-text radio production of the play, co-produced by the BBC and the Renaissance Theatre Company. Features Kenneth Branagh as Hamlet,
    Derek Jacobi and Claudius, Judi Dench as Gertrude, and John Gielgud as the Ghost.]]></description>
    		<author><![CDATA[William Shakespeare]]></author>
    		<author><![CDATA[Kenneth Branagh]]></author>
    		<author><![CDATA[Derek Jacobi]]></author>
    		<author><![CDATA[Judi Dench]]></author>
    		<author><![CDATA[Renaissance Theatre Company]]></author>
    		<author><![CDATA[British Broadcasting Corporation]]></author>
    		<date>1992</date>
    		<keywords>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    			<keyword><![CDATA[audio]]></keyword>
    			<keyword><![CDATA[radio]]></keyword>
    			<keyword><![CDATA[BBC Radio 3]]></keyword>
    		</keywords>
    		<duration>235</duration>
    		<duration_units>min</duration_units>
    		<filetype>Sound recording</filetype>
    		<filesize>399.7MB</filesize>
    		<format>mp3</format>
    	</file>
    	<file>
    		<title><![CDATA[King Lear]]></title>
    		<uri>http://shakespeare.example.com/video/lear.mov</uri>
    		<description><![CDATA[King Lear is a 2018 British-American television film directed by Richard Eyre. An adaptation of the play of the same name by William
    Shakespeare, cut to just 115 minutes, was broadcast on BBC Two on 28 May 2018.]]></description>
    		<author><![CDATA[William Shakespeare]]></author>
    		<author><![CDATA[Richard Eyre]]></author>
    		<author><![CDATA[Jim Broadbent]]></author>
    		<author><![CDATA[Jim Carter]]></author>
    		<author><![CDATA[Tobias Menzies]]></author>
    		<author><![CDATA[Emily Watson]]></author>
    		<author><![CDATA[John Macmillan]]></author>
    		<author><![CDATA[Florence Pugh]]></author>
    		<author><![CDATA[Emma Thompson]]></author>
    		<author><![CDATA[Anthony Calf]]></author>
    		<author><![CDATA[Anthony Hopkins]]></author>
    		<author><![CDATA[Simon Manyonda]]></author>
    		<author><![CDATA[Chukwudi Iwuji]]></author>
    		<author><![CDATA[Karl Johnson]]></author>
    		<author><![CDATA[Samuel Valentine]]></author>
    		<author><![CDATA[Andrew Scott]]></author>
    		<author><![CDATA[Christopher Eccleston]]></author>
    		<date>2018</date>
    		<keywords>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    		</keywords>
    		<duration>115</duration>
    		<duration_units>min</duration_units>
    		<filetype>Video recording</filetype>
    		<filesize>4.8GB</filesize>
    		<format>mov</format>
    	</file>
    </files>
  2. Make the XML available at a web accessible address. (e.g. http://shakespeare.example.com/files.xml)

  3. Ensure that the XML file is included in your search. e.g. for a web data source you could add the XML’s URL to your start URLs.

  4. Update your search data source.

  5. Set the following XML processing options (Note the paths here are specific to the example XML above). This will split the XML document into multiple records, and assign the URL and filetype based on the contents of specified fields in the XML.

    1. XML document splitting: /files/file

    2. Document URL: /files/file/uri

    3. Document filetype: /files/file/format

  6. Create metadata mappings for all of the fields that you wish to include in the index. e.g.

    1. t: //title

    2. author: //author

    3. etc.

  7. Re-index the live view to incorporate the metadata.

  8. At this point you should see the additional results appearing in your search results. You will need to modify your template to display the result appropriately.

Add an additional filetype that is supported by Tika

The steps for adding additional filetypes vary depending on the data source type being used.

Example: Add additional tika-supported filetypes to web data sources

  1. Ensure the filetype extension is not present in the crawler.reject_files list. The default value in the data source configuration is:

    crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip
  2. Set the parser mime types. If you wish links to be extracted (for crawl purposes) from the document then ensure the mime type is listed in the crawler.parser.mimeTypes list. note: only text documents should be listed here. The default value in the data source configuration is:

    crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml
  3. Configure the non html files list. Add the file extension of the new filetype to the crawler.non_html list. The default value in the data source configuration is:

    crawler.non_html=pdf,doc,ps,ppt,xls,rtf
  4. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value in the data source configuration is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  5. Run a full update of the data source.

Example: Add additional tika-supported filetypes to filecopy data sources

  1. Add the file extension of the new filetype to the filecopy.filetypes list. The default value in the data source configuration is:

    filecopy.filetypes=doc,docx,rtf,pdf,html,xls,xlsx,txt,htm,ppt,pptx
  2. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value in the data source configuration is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  3. Run a full update of the data source.

Example: Add additional tika-supported filetypes to trimpush data sources

  1. Add the file extension of the new filetype to the trim.extracted_file_types list. The default value in the data source configuration is:

    trim.extracted_file_types=*,doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,txt,htm,html,jpg,gif,tif,vmbx
  2. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value in the data source configuration is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  3. Run a full update of the data source.

Example: Add additional tika-supported filetypes to other data source types

  1. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value in the data source configuration is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  2. Run a full update of the data source.

Example: Add an additional filetype using an external converter

This feature is not available in the Squiz DXP.
The use of external converters is generally discouraged as there may be a significant impact on performance as a separate system process is run for each document that is being filtered.
  1. Install binaries. Ensure any extra binaries are installed onto the Funnelback server and made executable by the search user (or relevant Windows user account used to run updates).

  2. Add any new binaries to executables.cfg and create a textify.cfg containing extension to command mappings.

  3. Ensure that ExternalFilterProvider is included in the filter chain for the data source. The default value in the data source configuration is:

    filter.classes=TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
  4. Ensure that the filetype is added to the acceptable files for the data source using the Tika instructions above (web: crawler.non_html and optionally crawler.parser.mimeTypes; filecopy: filecopy.filetypes; TRIM/HP RM: trim.extractedfile_types)

  5. If the external filter is overriding Tika then ensure that the file extension is removed from filter.tika.types.

  6. Run a full update of the data source.