Crawling paginated XML or JSON

The web crawler is designed to extract links from content that it downloads. It achieves this by parsing the downloaded content and processing it using a regular expression that is by default configured to extract links from HTML documents (i.e. from href and src attributes).

If you need to crawl a paginated XML response it’s unlikely that the default match rules will apply.

Fortunately, you can enable the additional crawler link extraction, or the regular expression that is used to extract the links by the crawler’s parser can be altered on a per-data source basis.

Consider the following XML.

<?xml version="1.0" encoding="utf-8"?>
<feed>
	<item>
		<title>Item title 34</title>
		<id>34</id>
	</item>
	<item>
		<title>Item title 35</title>
		<id>35</id>
	</item>
	<item>
		<title>Item title 36</title>
		<id>36</id>
	</item>
	<pagination type="prev" url="http://server.com/api/items?page=6" />
	<pagination type="next" url="http://server.com/api/items?page=7" />
</feed>

To crawl this we need to extract the link from <pagination type="next" url="http://server.com/api/items?page=7" />

The easiest way to crawl these links is to enable the crawler’s additional link extraction in the data source configuration.

crawler.use_additional_link_extraction=true

Alternatively, set the following to control the crawler’s link extraction that will be applied by the parser:

crawler.link_extraction_regular_expression=<pagination type="next" url="(.+?)" />
crawler.link_extraction_group=1

This tells the crawler that links will be extracted from anything that matches <link rel="next" href="(.+?)" /> and that the first capture group will be used for the link (in this case we are only capturing a single group). Note: this assumes that the XML files are being crawled in their own separate collection - the altered link extraction rule will now only match the next page link in the above XML.

In order for the XML file to be parsed the mime type returned by the web server must match a mime type in the crawler’s parser mimetypes list. This defaults to:

crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml

Update this value if the XML is using a non-standard mime type.

If the files being parsed are very large you may also need to increase the parse size, which is how far into a document that the crawler will parse for links.

e.g. increase the parse size to 15MB

crawler.max_parse_size=15

After adding the appropriate data source configuration settings, save and run an update.

This will enable the crawler to follow and store the next page links as listed in the XML. These will all be indexed by Funnelback normally (so you can split and map the XML fields as normal).

See also