Delimited text, XML and JSON data sources

Background

This article describes how a web collection can be used to create an index of delimited text, XML and JSON data sources.

Data contained within delimited text files, XML and JSON can be indexed by Funnelback as additional data sources within and enterprise search.

The CSV filter includes support for CSV, TSV, Microsoft Excel and SQL files.

Delimited text, XML and JSON data sources that are available via HTTP should normally be indexed using a web collection and the filter framework used to convert the data to XML which is natively indexed by Funnelback.

The fields (once converted to XML) are then mapped to Funnelback metadata classes.

Method

  1. Create a web collection

    • Ensure that the URLs for the CSV/XML/JSON file(s) are added to the start URLs list and include patterns.

    • If any of the files being downloaded are larger than 10MB in size add the following options to collection.cfg to ensure the file(s) are all stored. Note: <SIZE-IN-MB> should be set to a value greater than the size of the largest file being downloaded.

      crawler.max_download_size=<SIZE-IN-MB>
      crawler.max_parse_size=<SIZE-IN-MB>
  2. Add filters and format specific settings

    • XML: No filters are required for XML as this is natively supported by Funnelback, however if the XML file is returned with the incorrect content type headers or is missing the XML declaration it may be detected as HTML and have the HTML parser apply. This can be seen in the Step.Index.log with the entry for the file being marked as HTML rather than XML. There is a ForceXMLMime that can be added to the filter chain to force all downloaded documents to be indentified as XML (similar to ForceJSONMime above).

    • CSV, TSV, SQL, MS Excel: Add the following to the filter_classes in collection.cfg:

      filter.classes=CSVToXml

      It may also be necessary to chain this filter with the corresponding filter to force the mime type if the web server is returning the files using an incorrect mime type:

      filter.classes=ForceCSVMime:CSVToXml

      For CSV there may be some additional settings to configure (such as if the file has a header row or the type of CSV file as this filter supports some other delimited file formats).

    • JSON: Add the following to the filter_classes in collection.cfg:

      filter.classes=JSONToXml

      It may also be necessary to chain this filter with the corresponding filter to force the mime type if the web server is returning the files using an incorrect mime type. E.g. or

      filter.classes=ForceJSONMime:JSONToXml
  3. Run a crawl

  4. Edit the metadata mappings. Note that CSV and JSON files will be converted to XML so the metadata mappings need to use the XPaths that correspond to the converted format. See the documentation for the CSVToXML and JSONToXML filters for more information on the mappings.

  5. Edit the XML processing options to configure a split path and other XML-related settings. Note: if additional filtering of the XML is required the see the article on splitting XML which provides details on how a filter can be used to split the XML.