Funnelback logo

Documentation

Data reports

Introduction

The data reports provide statistics and reports on data gathered during the update process. These reports are only available for web collections.

Reports

File sizes

The "file sizes" report displays statistics on content sizes found.

Content_size.png

File sizes by document type

The "file sizes by document type" report displays statistics on file sizes found, divided by content type.

Content_size_by_type.png

File types by MIME

The "file types by MIME" report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the "types by suffix" report may indicate a webserver serving documents with an incorrect content type.

Types_by_mime.png

File types by suffix

The "file types by suffix" report displays statistics on document types as identified by checking of the suffix.

Types_by_suffix.png

Files per site

The "files per site" report displays statistics on the amount of pages seen per site.

Pages_per_site.png

Referenced file types

The "referenced file types" report displays statistics on types referenced by crawled documents: for example, one HTML page may reference 5 other HTML pages, 3 images, 1 CSS file and 4 directories. Directories will be displayed as "unknown" referenced types. This is a HTML document type specific report.

Referenced_types.png

URL lengths

The "URL lengths" report displays statistics on the lengths of URLs encountered by the crawler. Documents with small URL lengths may be upweighted by the indexer, as well as being preferred by users.

Url_lengths.png

Web server types

The "web server types" report displays statistics on web servers that have served crawled documents.

Server_technologies.png

Broken Link Reports

These reports give a breakdown of which URLs contain broken links in the crawled dataset. It is possible to drill down into the report by clicking on the "site" link. CSV exports of the data are also available.

Generating Data Reports

Data reports are enabled by default for web collections, and are based on statistic files (.stat) generated by the web crawler in the log directory. These files are processed by a script called make_report.pl which outputs a set of static HTML reports in $INSTALL_ROOT/admin/data_report/<coll-id>

See also

top ⇑