Built-in filters - Convert JSON documents to XML (JSONToXML)
The JSONToXML
filter converts JSON documents into XML documents.
Enabling
To enable the filter add JSONToXML
to the filter chain. Documents will only be filtered if the document has the MIME type application/json
.
If all documents being gathered are JSON documents you can use the ForceJSONMime
filter before this filter (e.g. ForceJSONMime:JSONToXML
) to set the MIME type to JSON to filter all documents as JSON.
As JSON is a data format and XML is a document format the filter is forced to make minor modifications in an attempt to create valid XML. In general keys
in JSON that are turned into XML elements may be modified such that:
-
characters that can not be in a XML element name are replaced with underscore e.g.
"a b"
would become<a_b>
. -
if the element does not start with a letter or underscore a underscore will be prepended e.g.
"123"
would become<_123>
.
Element content is also be modified such that characters that can not be in XML are removed for example foo: "count\u0000down"
would become <foo>countdown</foo>
. In general the XML produced will try to be valid version 1.0 XML.
Examples
To add the JSON to XML conversion to an existing filter chain:
filter.classes=<default_filter_chain>:JSONToXML
where <default_filter_chain>
is the default value for filter.classes
.
Sometimes the remote server sends an incorrect MIME type and JSON documents are not correctly identified. Most JSON based data sources index only JSON so an additional filter has been provided to force the JSON mime type on all files that are crawled.
To force the gatherer to treat all documents as JSON and convert all entries to XML:
filter.classes=<default_filter_chain>:ForceJSONMime:JSONToXML
Downloading JSON on web collections
To allow the web crawler to download JSON documents you may need to add json
to the crawler.non_html
data source configuration option.
JSON to XML conversion example
The JSONToXML
filter uses the field names from the JSON file when generating the XML for indexing.
For example the JSON file:
{
"items": [
{
"title":"value",
"subject":"value2",
"url":"http://mysite/item45745.html"
},
{
"title":"value3",
"subject":"value4",
"url":"http://mysite/item12544.html"
}
]
}
is converted to:
<json>
<items>
<title>value</title>
<subject>value2</subject>
<url>http://mysite/item45745.html</url>
</items>
<items>
<title>value3</title>
<subject>value4</subject>
<url>http://mysite/item12544.html</url>
</items>
</json>
Example: An un-named array of objects is placed inside an <array>
element.
{
[
{
"title":"value",
"subject":"value2",
"url":"http://mysite/item45745.html"
},
{
"title":"value3",
"subject":"value4",
"url":"http://mysite/item12544.html"
}
]
}
is converted to:
<json>
<array>
<title>value</title>
<subject>value2</subject>
<url>http://mysite/item45745.html</url>
</array>
<array>
<title>value3</title>
<subject>value4</subject>
<url>http://mysite/item12544.html</url>
</array>
</json>
The fields can be mapped to metadata using the normal rules for XML field mapping.
Metadata class configuration:
Metadata class name | Metadata class type | Source fields | Source type |
---|---|---|---|
itemTitle |
text |
|
XML |
itemDescriptors |
text |
|
XML |
The following additional XML special configuration can optionally be set if one of the fields contains a URL that should be the target URL when a result for row is clicked.
-
XML document splitting:
/json/items
-
Document URL:
/json/items/url