Data reports


The data reports provide statistics and reports on data gathered during the update process. These reports are only available for web collections.


File sizes

The "file sizes" report displays statistics on content sizes found.


File sizes by document type

The "file sizes by document type" report displays statistics on file sizes found, divided by content type.


File types by MIME

The "file types by MIME" report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the "types by suffix" report may indicate a webserver serving documents with an incorrect content type.


File types by suffix

The "file types by suffix" report displays statistics on document types as identified by checking of the suffix.


Files per site

The "files per site" report displays statistics on the amount of pages seen per site.


Referenced file types

The "referenced file types" report displays statistics on types referenced by crawled documents: for example, one HTML page may reference 5 other HTML pages, 3 images, 1 CSS file and 4 directories. Directories will be displayed as "unknown" referenced types. This is a HTML document type specific report.


URL lengths

The "URL lengths" report displays statistics on the lengths of URLs encountered by the crawler. Documents with small URL lengths may be upweighted by the indexer, as well as being preferred by users.


Web server types

The "web server types" report displays statistics on web servers that have served crawled documents.


Broken Link Reports

These reports give a breakdown of which URLs contain broken links in the crawled dataset. It is possible to drill down into the report by clicking on the "site" link. CSV exports of the data are also available.

Generating Data Reports

Data reports are enabled by default for web collections, and are based on statistics files (.stat) generated by the web crawler in the log directory. These files are processed by a script called which outputs a set of static HTML reports in $SEARCH_HOME/admin/data_report/<collection>/

See also