Logs reference

Background

This article describes most of the log files produced by Funnelback.

The following abbreviations are used in this document:

  • SEARCH-PACKAGE-ID - Search package ID, as displayed on the search package manage screen.

  • DATA-SOURCE-ID - Data source ID, as displayed on the data source manage screen.

  • RESULTS-PAGE-ID - Results page ID, as displayed on the results page manage screen.

Search package logs

User interface logs

modernui.Admin.log

  • Location: Search package - collection logs

  • Purpose: Linux only. Contains messages logged by the admin search endpoints. e.g. template error logging.

modernui.Public.log

  • Location: Search package - collection logs

  • Purpose: Linux only. Contains messages logged by the public search endpoints e.g. template error logging.

Analytics update logs

pattern_analyser.log

  • Location: Search package - collection logs

  • Purpose: Logs the output from outliers-log-processing.pl. Logs messages for the pattern analyser reports build.

update_reports_launch.log

  • Location: Search package - collection logs

  • Purpose: Logs messages for the analytics report build.

update_reports.log

  • Location: Search package - collection logs

  • Purpose: Logs the output from reports-load-queries.pl. Logs messages for the analytics reports build.

update_reports_previous.log

  • Location: Search package - collection logs

  • Purpose: Messages from the previous reports build.

Results page logs

Step-BuildAutoCompletion.RESULTS-PAGE-ID.log

  • Location: Search package - live/offline logs

  • Purpose: Logs the output from build_autoc. Logging for each auto-completion index. An auto-completion index is built for each profile.

Data source logs (non-push data source)

Update logs - all data source types

DATA-SOURCE-ID.lock

  • Location: Data source - collection logs

  • Purpose: These files are used to prevent multiple updates running simulatiously (by taking an OS lock on the file). They’re generally empty and hence contain no useful info.

DATA-SOURCE-ID.pre-update.log

  • Location: Data source - collection logs

  • Purpose: Logs the java command that was run for the update.

update-DATA-SOURCE-ID.log

  • Location: Data source - collection logs

  • Purpose: Logs the output from update.pl. Logs messages for the data source update process.

update-DATA-SOURCE-ID.previous.log

  • Location: Data source - collection logs

  • Purpose: Messages from the previous data source update.

update.log

  • Location: Data source - live/offline logs

  • Purpose: Top level log for the update pipeline.

Gather logs - web and matrix data sources

binaries.log

  • Location: Data source - live/offline logs

  • Purpose: List of binary files stored e.g. (PDF, DOC etc.) Appended to on restart from checkpoint.

copied_urls.log

  • Location: Data source - live/offline logs

  • Purpose: Log output if crawler.incremental_logging=true. All URLs whose content was copied from the previous crawl, as they had not changed and so were not downloaded again.

crawler.log

  • Location: Data source - live/offline logs

  • Purpose: Contains filter messages.

crawler_logs_checkpoint_sizes.dat

  • Location: Data source - live/offline logs

  • Purpose: Records the sizes of crawler logs at the time a checkpoint occurred. This allows truncating them back to that size if the crawler is restarted from a checkpoint to avoid inconsistency.

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Top level log for the web crawler.

crawl.<N>.log

  • Location: Data source - live/offline logs

  • Purpose: Logs individual messages for each running crawler thread. One log per thread.

domains.log

  • Location: Data source - live/offline logs

  • Purpose: Frontier and stored document counts for each domain encountered during the crawl.

frontier_dump.log

  • Location: Data source - live/offline logs

  • Purpose: Contains a dump of the crawl frontier (set of known, uncrawled URLs).

headers.log

  • Location: Data source - live/offline logs

  • Purpose: Captures HTTPS headers recorded during the crawl. Output if crawler.header_logging=true

manifest.txt

  • Location: Data source - live/offline logs

  • Purpose: Records the order in which bundles were created.

monitor.log

  • Location: Data source - live/offline logs

  • Purpose: Records various crawl statistics.

new_urls.log

  • Location: Data source - live/offline logs

  • Purpose: A new URL is defined as one which was not stored in the previous crawl. Log output if crawler.incremental_logging=true.

redirects.txt

  • Location: Data source - live/offline logs

  • Purpose: Appended to on restart from checkpoint. Captures redirects.

servers.log

  • Location: Data source - live/offline logs

  • Purpose: Frontier and stored document counts for each server encountered during the crawl.

stored.log

  • Location: Data source - live/offline logs

  • Purpose: All URLs stored during a crawl (also used for refresh updates), in chronological order. Appended to on restart from checkpoint.

store-messages.log

  • Location: Data source - live/offline logs

  • Purpose: Records URLs stored into a WARC/Mirror store as well as edit distance calculation logs and certain error/warning states of MirrorStore.

url_errors.log

  • Location: Data source - live/offline logs

  • Purpose: Records errors while processing URLs.

url_no_content.log

  • Location: Data source - live/offline logs

  • Purpose: Contains URLs which are stored despite having no content. Documents with no content are usually the result of a filter returning an empty document. This occurs when crawler.store_empty_content_urls=true.

url_titles.log

  • Location: Data source - live/offline logs

  • Purpose: Lists URLs and their titles.

BroadMIMETypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.

BroadWebServerTypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: Records web-server types in general categories (anything after a / or ( is truncated) - e.g. Will capture Apache without the version.

CrawlSizeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: Records the total URLs stored by a web crawl.

FileSizeByDocumentTypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: The file sizes by document type report displays statistics on file sizes found, divided by content type.

FileSizeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: The file sizes report displays statistics on content sizes found.

MIMETypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.

ReferencedFileTypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: Records the file extensions of URLs seen in href/src HTML attributes even if those URLs would not be crawled.

SuffixTypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: The file types by suffix report displays statistics on document types as identified by checking of the suffix.

URLlengthStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: Records the number of URLs of different lengths seen during a crawl.

WebServerTypeStatistic.stat

  • Location: Data source - live/offline logs

  • Purpose: Records statistics the web server types (including versions etc.) seen during the crawl.

Gather logs - database data source

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages from the database gatherer.

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Wrapper log for the gather process.

Gather logs - directory data source

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages from the directory gatherer.

Gather logs - filecopy data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Wrapper log for the gather process.

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages from file copier data source updates.

stored.log

  • Location: Data source - live/offline logs

  • Purpose: Lists the documents stored by the file copier.

monitor.log

  • Location: Data source - live/offline logs

  • Purpose: Records various statistics about the filecopier update.

url_errors.log

  • Location: Data source - live/offline logs

  • Purpose: Records errors while processing URLs.

Gather logs - custom data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output from the custom gatherer.

Gather logs - TRIM (trimpush) data source

trim.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output from the TRIM gatherer.

trim-details.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output from the TRIM gatherer.

trim-combine-attachments.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages from the combine attachments process.

filter.log

  • Location: Data source - live/offline logs

  • Purpose: Logs filter messages from the TRIM update (filtering via the Funnelback daemon filter service).

monitor.log

  • Location: Data source - live/offline logs

  • Purpose: Provides various statistics about the TRIM gather process.

error.log

  • Location: Data source - live/offline logs

  • Purpose: Logs TRIM IDs of documents that recorded a failure.

Gather logs - Slack (slackpush) data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output from the Slack gatherer.

Gather logs - Facebook data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Wrapper log for the gather process.

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Contains detail for the Facebook gather process.

Gather logs - Flickr data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Wrapper log for the gather process.

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Contains detail for the Flickr gather process.

Gather logs - Twitter data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Wrapper log for the gather process.

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Contains detail for the Twitter gather process.

Gather logs - YouTube data source

gather_executable.log

  • Location: Data source - live/offline logs

  • Purpose: Wrapper log for the gather process.

gather.log

  • Location: Data source - live/offline logs

  • Purpose: Contains detail for the YouTube gather process.

Indexer logs - all data source types

Step-AnnieAPrimaryCollection.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from annie-a. Logging for the build of the annotation index.

Step-BuildAutoCompletion.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from build_autoc. Logging for the auto-completion index build.

Step-BuildCollapsingSignatures.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from padre-cc. Logging for the collapsing index build process.

Step-BuildSpelling.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from build_spelling_index. Logging for the spelling index build process.

Step-Index.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from padre-iw. Logging for the data source index build process.

Step-QIEUpdate.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from padre-qi. Logging for the query independent evidence index build process. (previously called: Step-QueryIndependentEvidenceCollectionLevel.log)

Step-SetGscopes.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from padre-gs. Logging for application of data source level gscopes.

Step-FacetBasedGscopes.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output for setting up query facets.

Step-MoveTmpIndexIntoPlace.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output for index move step.

Step-ClearGscopes.log

  • Location: Data source - live/offline logs

  • Purpose: Logs the output from padre-gs when clearing any previously set gscopes.

Step-ExactMatchKill.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output from padre-fl for killing documents specified in kill_exact.cfg.

Step-FacetedNavigation.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages relating to the building of query-based faceted navigation.

Step-PartialMatchKill.log

  • Location: Data source - live/offline logs

  • Purpose: Logs output from padre-fl for killing documents specified in kill_partial.cfg.

Step-SecondaryIndex.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages relating to the creation of secondary indexes for an instant update.

Step-SetExtraRegexGscopes.log

  • Location: Data source - live/offline logs

  • Purpose: Logs messages relating to the setting of gscopes from other sources.

Push data source logs

Step-BuildCollapsingSignatures.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from padre-cc. Logging for the collapsing index build process.

Step-BuildMatchOnlyIndexForAutoc.log

  • Location: Data source - push live logs

  • Purpose: Logs message relating to the creation of a match only index which is used to build profile-scoped auto-completion indexes.

Step-ClearGscopes.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from padre-gs when clearing any previously set gscopes.

Step-Index.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from padre-iw. Logging for the index build process.

Step-KillDocuments.log

  • Location: Data source - push live logs

  • Purpose: Logs messages relating to the killing of documents as part of maintaining the integrity of the search index. Note: push data sources do not support the use of kill configuration files - the API should be used for this.

Step-QueryIndependentEvidenceCollectionLevel.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from padre-qi. Logging for the query independent evidence index build process.

Step-SetExtraRegexGscopes.log

  • Location: Data source - push live logs

  • Purpose: Logs messages relating to the setting of gscopes from other sources.

Step-SortDocumentUrls.log

  • Location: Data source - push live logs

  • Purpose: Logs messages relating to the sorting of index.urls in order to speed up document kill operations.

Step-annie-a.log

  • Location: Data source - push live logs

  • Purpose: Logs messages relating to the building of annotation indexes.

Step-BuildAutoCompletion.RESULTS-PAGE-ID.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from build_autoc. Logging for each auto-completion index. An auto-completion index is built for each profile.

Step-BuildAutoCompletion.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from build_autoc. Logging for the auto-completion index build.

Step-BuildSpelling.log

  • Location: Data source - push live logs

  • Purpose: Logs the output from build_spelling_index. Logging for the spelling index build process.