Web data source

Web-collections.png

A web data source is a set of documents obtained from one or more web sites. Web data sources contain HTML, PDF and Microsoft Office files that are gathered using a web crawler which discovers content by following the links it finds.

In order to avoid crawling the entire Internet the crawler uses a number of data source configuration options to determine which links it will follow and what web sites or domains it should limit its crawl to.

Web data source basics

The web crawler works by accessing a number of defined URLs in the seed list and extracting the links from these pages. A number of checks are performed on each link as it is extracted to determine if the link should be crawled. The link is compared against a set of include/exclude rules and a number of other settings (such as acceptable file types) determining if the link is suitable for inclusion in the index. A link that is deemed suitable is added to a list of un-crawled URLs called the crawl frontier.

The web crawler will continue run taking URLs off the crawl frontier, extracting links in the pages and checking the links against the include/exclude rules until it runs out of links in the crawl frontier, or an overall timeout is reached.

Creating a web data source

Web data sources require at a minimum a set of seed URLs and include/exclude patterns to be defined when setting up the data source.

To create a new web data source:

  1. Follow the steps to create a data source, choosing web as the data source type.

See: Include and exclude patterns for information about defining what should be included (and excluded) from a web crawl.

Web crawler logs

The web crawler writes a number of log files detailing various aspects of the crawl. See: web crawler logs

© 2015- Squiz Pty Ltd