Crawling paginated XML or JSON

Introduction

The web crawler is designed to extract links from content that it downloads. It achieves this by parsing the downloaded content and processing it using a regular expression that is by default configured to extract links from HTML documents (ie. from href and src attributes).

If you need to crawl a paginated XML response it’s unlikely that the existing match rules will apply. Fortunately the regular expression that is used to extract the links by the crawler’s parser can be altered on a per-collection basis.

Consider the following XML.

<?xml version="1.0" encoding="utf-8"?>
<feed>
	<item>
		<title>Item title 34</title>
		<id>34</id>
	</item>
	<item>
		<title>Item title 35</title>
		<id>35</id>
	</item>
	<item>
		<title>Item title 36</title>
		<id>36</id>
	</item>
	<pagination type="prev" url="http://server.com/api/items?page=6" />
	<pagination type="next" url="http://server.com/api/items?page=7" />
</feed>

To crawl this we need to extract the link from <pagination type="next" url="http://server.com/api/items?page=7" />

In collection.cfg set the following to control the crawler’s link extraction that will be applied by the parser:

crawler.link_extraction_regular_expression=<pagination type="next" url="(.+?)" />
crawler.link_extraction_group=1

This tells the crawler that links will be extracted from anything that matches <link rel="next" href="(.+?)" /> and that the first capture group will be used for the link (in this case we are only capturing a single group). Note: this assumes that the XML files are being crawled in their own separate collection - the altered link extraction rule will now only match the next page link in the above XML.

In order for the XML file to be parsed the mime type returned by the web server must match a mime type in the crawler’s parser mimetypes list. This defaults to:

crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml

Update this value if the XML is using a non-standard mime type.

If the files being parsed are very large you may also need to increase the parse size, which is how far into a document that the crawler will parse for links.

e.g. increase the parse size to 15MB

crawler.max_parse_size=15

After adding the appropriate collection.cfg settings, save and run an update.

This will enable the crawler to follow and store the next page links as listed in the XML. These will all be indexed by Funnelback normally (so you can split and map the XML fields as normal).