Built-in filters - Convert JSON documents to XML (JSONToXML)

The JSONToXML filter converts JSON documents into XML documents.

Enabling

To enable the filter add JSONToXML to the filter chain. Documents will only be filtered if the document has the MIME type application/json.

If all documents being gathered are JSON documents you can use the ForceJSONMime filter before this filter (e.g. ForceJSONMime:JSONToXML) to set the MIME type to JSON to filter all documents as JSON.

As JSON is a data format and XML is a document format the filter is forced to make minor modifications in an attempt to create valid XML. In general keys in JSON that are turned into XML elements may be modified such that:

  • characters that can not be in a XML element name are replaced with underscore e.g. "a b" would become <a_b>.

  • if the element does not start with a letter or underscore a underscore will be prepended e.g. "123" would become <_123>.

Element content is also be modified such that characters that can not be in XML are removed for example foo: "count\u0000down" would become <foo>countdown</foo>. In general the XML produced will try to be valid version 1.0 XML.

Examples

To add the JSON to XML conversion to an existing filter chain:

filter.classes=<default_filter_chain>:JSONToXML

where <default_filter_chain> is the default value for filter.classes.

Sometimes the remote server sends an incorrect MIME type and JSON documents are not correctly identified. Most JSON based data sources index only JSON so an additional filter has been provided to force the JSON mime type on all files that are crawled.

To force the gatherer to treat all documents as JSON and convert all entries to XML:

filter.classes=<default_filter_chain>:ForceJSONMime:JSONToXML

Downloading JSON on web collections

To allow the web crawler to download JSON documents you may need to add json to the crawler.non_html data source configuration option.

JSON to XML conversion example

The JSONToXML filter uses the field names from the JSON file when generating the XML for indexing.

For example the JSON file:

{
  "items": [
    {
      "title":"value",
      "subject":"value2",
      "url":"http://mysite/item45745.html"
    },
    {
      "title":"value3",
      "subject":"value4",
      "url":"http://mysite/item12544.html"
    }
  ]
}

is converted to:

<json>
  <items>
    <title>value</title>
    <subject>value2</subject>
    <url>http://mysite/item45745.html</url>
  </items>
  <items>
    <title>value3</title>
    <subject>value4</subject>
    <url>http://mysite/item12544.html</url>
  </items>
</json>

Example: An un-named array of objects is placed inside an <array> element.

{
  [
    {
      "title":"value",
      "subject":"value2",
      "url":"http://mysite/item45745.html"
    },
    {
      "title":"value3",
      "subject":"value4",
      "url":"http://mysite/item12544.html"
    }
  ]
}

is converted to:

<json>
  <array>
    <title>value</title>
    <subject>value2</subject>
    <url>http://mysite/item45745.html</url>
  </array>
  <array>
    <title>value3</title>
    <subject>value4</subject>
    <url>http://mysite/item12544.html</url>
  </array>
</json>

The fields can be mapped to metadata using the normal rules for XML field mapping.

Metadata class configuration:

Metadata class name Metadata class type Source fields Source type

itemTitle

text

//title

XML

itemDescriptors

text

//subject

XML

The following additional XML special configuration can optionally be set if one of the fields contains a URL that should be the target URL when a result for row is clicked.