Funnelback can index XML documents and there are some additional configuration files that are applicable to indexing XML files.
- You can map metadata classes to elements in the XML structure via xml.cfg.
- You can display cached copies of the document via XSLT Processing.
Example: Electronic books
Let's say you had a number of XML files representing electronic-books similar to:
<book> <info> <title> The Adventures Of Sherlock Holmes </title> <author> Arthur Conan Doyle </author> </info> <contents> <chapter>A Scandal in Bohemia</chapter> <chapter>The Red-headed League</chapter> ... <chapter>The Adventure of the Copper Beeches</chapter> </contents> </book>
Because the data is plain XML files, it doesn't need any text conversion (like PDFs), so you could use a local collection.
To map this XML structure to metadata classes for the author (a), title (t) and chapters (x), create the xml.cfg file containing:
a,1,,//author t,1,,//title x,1,,/book/contents/chapter
When this data is indexed, the text from these elements will be indexed and assigned to the specified metadata classes.
Because this is a local collection, there are a couple of configuration options that will help present the XML.
- Create the template.xsl script to convert the XML into HTML.
- Change the collection's search forms to use the cache_url instead of the live_url.
Crawling XML Files
To crawl XML files you will need to ensure that the crawler.parser.mimeTypes parameter includes text/xml as one of the MIME types the web crawler will accept.