Delimited text, XML and JSON data sources

Background

This article describes how a web data source can be used to create an index of delimited text, XML and JSON data sources.

Data contained within delimited text files, XML and JSON can be indexed by Funnelback as additional data sources within an enterprise search.

The CSV filter includes support for CSV, TSV, Microsoft Excel and SQL files.

Delimited text, XML and JSON data sources that are available via HTTP should normally be indexed using a web data source and the filter framework used to convert the data to XML which is natively indexed by Funnelback.

The fields (once converted to XML) are then mapped to Funnelback metadata classes.

Method

  1. Create a web data source

    • Ensure that the URLs for the CSV/XML/JSON file(s) are added to the start URLs list and include patterns.

    • If any of the files being downloaded are larger than 10MB in size add the following options to the data source configuration to ensure the file(s) are all stored. Note: SIZE_IN_MB should be set to a value greater than the size of the largest file being downloaded.

      crawler.max_download_size=SIZE_IN_MB
      crawler.max_parse_size=SIZE_IN_MB
  2. Enable the crawlers additional link extraction:

    In addition to setting the link extraction rules, you also need to ensure that the parser is run across the document. This is controlled by the crawler.parser.mimeTypes.

    crawler.use_additional_link_extraction=true
    crawler.parser.mimeTypes=application/json
    If you have large JSON or XML documents you might need to increase the crawler.max_parse_size for links to be successfully extracted. This should be set to a size large enough to process your largest document.
  3. Add filters and format specific settings

    • XML: No filters are required for XML as this is natively supported by Funnelback, however, if the XML file is returned with the incorrect content-type headers or is missing the XML declaration it may be detected as HTML and have the HTML parser apply. This can be seen in the Step.Index.log with the entry for the file being marked as HTML rather than XML. There is a ForceXMLMime that can be added to the filter chain to force all downloaded documents to be identified as XML (similar to ForceJSONMime above).

    • CSV, TSV, SQL, MS Excel: Add the following to the filter_classes in the data source configuration:

      filter.classes=CSVToXml

      It may also be necessary to chain this filter with the corresponding filter to force the mime type if the web server is returning the files using an incorrect mime type:

      filter.classes=ForceCSVMime:CSVToXml

      For CSV there may be some additional settings to configure (such as if the file has a header row or the type of CSV file as this filter supports some other delimited file formats).

    • JSON: Add the following to the filter_classes in the data source configuration:

      filter.classes=JSONToXml

      It may also be necessary to chain this filter with the corresponding filter to force the mime type if the web server is returning the files using an incorrect mime type:

      filter.classes=ForceJSONMime:JSONToXml
      The modify JSON data (modify-json-data) plugin can be enabled to transform the JSON data before it is indexed. This filter must run before the JSONToXml filter, and after the ForceJSONMime filter (if required). i.e.
      filter.classes=ForceJSONMime:com.funnelback.plugin.modifyjsondata.JsonataStringFilter:JSONToXml

      See:

  4. Run a crawl

  5. Edit the metadata mappings. Note that CSV and JSON files will be converted to XML so the metadata mappings need to use the XPaths that correspond to the converted format. See the documentation for the CSVToXML and JSONToXML filters for more information on the mappings.

  6. Edit the XML processing options to configure a split path and other XML-related settings.

    if additional filtering of the XML is required then a plugin can be used to split the XML instead of configuring and indexer split path.