The data reports provide statistics and reports on data gathered during the update process. These reports are only available for web collections.
The "file sizes" report displays statistics on content sizes found.
File sizes by document type
The "file sizes by document type" report displays statistics on file sizes found, divided by content type.
File types by MIME
The "file types by MIME" report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the "types by suffix" report may indicate a webserver serving documents with an incorrect content type.
File types by suffix
The "file types by suffix" report displays statistics on document types as identified by checking of the suffix.
Files per site
The "files per site" report displays statistics on the amount of pages seen per site.
Referenced file types
The "referenced file types" report displays statistics on types referenced by crawled documents: for example, one HTML page may reference 5 other HTML pages, 3 images, 1 CSS file and 4 directories. Directories will be displayed as "unknown" referenced types. This is a HTML document type specific report.
The "URL lengths" report displays statistics on the lengths of URLs encountered by the crawler. Documents with small URL lengths may be upweighted by the indexer, as well as being preferred by users.
Web server types
The "web server types" report displays statistics on web servers that have served crawled documents.
Broken Link Reports
These reports give a breakdown of which URLs contain broken links in the crawled dataset. It is possible to drill down into the report by clicking on the "site" link. CSV exports of the data are also available.
Generating Data Reports
Data reports are enabled by default for web collections, and are based on statistics files (.stat) generated by the web crawler in the log directory. These files are processed by a script called make_report.pl which outputs a set of static HTML reports in $SEARCH_HOME/admin/data_report/<collection>/