Logs reference
Background
This article describes most of the log files produced by Funnelback.
The following abbreviations are used in this document:
-
SEARCH-PACKAGE-ID
- Search package ID, as displayed on the search package manage screen. -
DATA-SOURCE-ID
- Data source ID, as displayed on the data source manage screen. -
RESULTS-PAGE-ID
- Results page ID, as displayed on the results page manage screen.
Search package logs
User interface logs
modernui.Admin.log
-
Location: Search package - collection logs
-
Purpose: Linux only. Contains messages logged by the admin search endpoints. e.g. template error logging.
modernui.Public.log
-
Location: Search package - collection logs
-
Purpose: Linux only. Contains messages logged by the public search endpoints e.g. template error logging.
Analytics update logs
pattern_analyser.log
-
Location: Search package - collection logs
-
Purpose: Logs the output from outliers-log-processing.pl. Logs messages for the trend alerts reports build.
update_reports_launch.log
-
Location: Search package - collection logs
-
Purpose: Logs messages for the analytics report build.
update_reports.log
-
Location: Search package - collection logs
-
Purpose: Logs the output from reports-load-queries.pl. Logs messages for the analytics reports build.
update_reports_previous.log
-
Location: Search package - collection logs
-
Purpose: Messages from the previous reports build.
Results page logs
Step-BuildAutoCompletion.RESULTS-PAGE-ID.log
-
Location: Search package - live/offline logs
-
Purpose: Logs the output from
build_autoc
. Logging for each auto-completion index. An auto-completion index is built for each profile.
Data source logs (non-push data source)
Update logs - all data source types
DATA-SOURCE-ID.lock
-
Location: Data source - collection logs
-
Purpose: These files are used to prevent multiple updates running simulatiously (by taking an OS lock on the file). They’re generally empty and hence contain no useful info.
DATA-SOURCE-ID.pre-update.log
-
Location: Data source - collection logs
-
Purpose: Logs the java command that was run for the update.
update-DATA-SOURCE-ID.log
-
Location: Data source - collection logs
-
Purpose: Logs the output from update.pl. Logs messages for the data source update process.
update-DATA-SOURCE-ID.previous.log
-
Location: Data source - collection logs
-
Purpose: Messages from the previous data source update.
update.log
-
Location: Data source - live/offline logs
-
Purpose: Top level log for the update pipeline.
Gather logs - web and matrix data sources
binaries.log
-
Location: Data source - live/offline logs
-
Purpose: List of binary files stored e.g. (PDF, DOC etc.) Appended to on restart from checkpoint.
copied_urls.log
-
Location: Data source - live/offline logs
-
Purpose: Log output if
crawler.incremental_logging=true
. All URLs whose content was copied from the previous crawl, as they had not changed and so were not downloaded again.
crawler.log
-
Location: Data source - live/offline logs
-
Purpose: Contains filter messages.
crawler_logs_checkpoint_sizes.dat
-
Location: Data source - live/offline logs
-
Purpose: Records the sizes of crawler logs at the time a checkpoint occurred. This allows truncating them back to that size if the crawler is restarted from a checkpoint to avoid inconsistency.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Top level log for the web crawler.
crawl.<N>.log
-
Location: Data source - live/offline logs
-
Purpose: Logs individual messages for each running crawler thread. One log per thread.
domains.log
-
Location: Data source - live/offline logs
-
Purpose: Frontier and stored document counts for each domain encountered during the crawl.
frontier_dump.log
-
Location: Data source - live/offline logs
-
Purpose: Contains a dump of the crawl frontier (set of known, uncrawled URLs).
headers.log
-
Location: Data source - live/offline logs
-
Purpose: Captures HTTPS headers recorded during the crawl. Output if
crawler.header_logging=true
manifest.txt
-
Location: Data source - live/offline logs
-
Purpose: Records the order in which bundles were created.
monitor.log
-
Location: Data source - live/offline logs
-
Purpose: Records various crawl statistics.
new_urls.log
-
Location: Data source - live/offline logs
-
Purpose: A new URL is defined as one which was not stored in the previous crawl. Log output if
crawler.incremental_logging=true
.
redirects.txt
-
Location: Data source - live/offline logs
-
Purpose: Appended to on restart from checkpoint. Captures redirects.
servers.log
-
Location: Data source - live/offline logs
-
Purpose: Frontier and stored document counts for each server encountered during the crawl.
stored.log
-
Location: Data source - live/offline logs
-
Purpose: All URLs stored during a crawl (also used for refresh updates), in chronological order. Appended to on restart from checkpoint.
store-messages.log
-
Location: Data source - live/offline logs
-
Purpose: Records URLs stored into a WARC/Mirror store as well as edit distance calculation logs and certain error/warning states of MirrorStore.
url_errors.log
-
Location: Data source - live/offline logs
-
Purpose: Records errors while processing URLs.
url_no_content.log
-
Location: Data source - live/offline logs
-
Purpose: Contains URLs which are stored despite having no content. Documents with no content are usually the result of a filter returning an empty document. This occurs when
crawler.store_empty_content_urls=true
.
url_titles.log
-
Location: Data source - live/offline logs
-
Purpose: Lists URLs and their titles.
BroadMIMETypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.
BroadWebServerTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records web-server types in general categories (anything after a / or ( is truncated) - e.g. Will capture Apache without the version.
CrawlSizeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records the total URLs stored by a web crawl.
FileSizeByDocumentTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file sizes by document type report displays statistics on file sizes found, divided by content type.
FileSizeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file sizes report displays statistics on content sizes found.
MIMETypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file types by MIME report displays statistics on document types as reported by the web server. A significant difference between the document types reported here and the document types reported by the types by suffix report may indicate a web server serving documents with an incorrect content type.
ReferencedFileTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records the file extensions of URLs seen in href/src HTML attributes even if those URLs would not be crawled.
SuffixTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: The file types by suffix report displays statistics on document types as identified by checking of the suffix.
URLlengthStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records the number of URLs of different lengths seen during a crawl.
WebServerTypeStatistic.stat
-
Location: Data source - live/offline logs
-
Purpose: Records statistics the web server types (including versions etc.) seen during the crawl.
Gather logs - database data source
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages from the database gatherer.
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Wrapper log for the gather process.
Gather logs - directory data source
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages from the directory gatherer.
Gather logs - filecopy data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Wrapper log for the gather process.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages from file copier data source updates.
stored.log
-
Location: Data source - live/offline logs
-
Purpose: Lists the documents stored by the file copier.
monitor.log
-
Location: Data source - live/offline logs
-
Purpose: Records various statistics about the filecopier update.
url_errors.log
-
Location: Data source - live/offline logs
-
Purpose: Records errors while processing URLs.
Gather logs - custom data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output from the custom gatherer.
Gather logs - TRIM (trimpush) data source
trim.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output from the TRIM gatherer.
trim-details.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output from the TRIM gatherer.
trim-combine-attachments.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages from the combine attachments process.
filter.log
-
Location: Data source - live/offline logs
-
Purpose: Logs filter messages from the TRIM update (filtering via the Funnelback daemon filter service).
monitor.log
-
Location: Data source - live/offline logs
-
Purpose: Provides various statistics about the TRIM gather process.
error.log
-
Location: Data source - live/offline logs
-
Purpose: Logs TRIM IDs of documents that recorded a failure.
Gather logs - Slack (slackpush) data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output from the Slack gatherer.
Gather logs - Facebook data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Wrapper log for the gather process.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Contains detail for the Facebook gather process.
Gather logs - Flickr data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Wrapper log for the gather process.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Contains detail for the Flickr gather process.
Gather logs - Twitter data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Wrapper log for the gather process.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Contains detail for the Twitter gather process.
Gather logs - YouTube data source
gather_executable.log
-
Location: Data source - live/offline logs
-
Purpose: Wrapper log for the gather process.
gather.log
-
Location: Data source - live/offline logs
-
Purpose: Contains detail for the YouTube gather process.
Indexer logs - all data source types
Step-AnnieAPrimaryCollection.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
annie-a
. Logging for the build of the annotation index.
Step-BuildAutoCompletion.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
build_autoc
. Logging for the auto-completion index build.
Step-BuildCollapsingSignatures.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
padre-cc
. Logging for the collapsing index build process.
Step-BuildSpelling.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
build_spelling_index
. Logging for the spelling index build process.
Step-Index.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
padre-iw
. Logging for the data source index build process.
Step-QIEUpdate.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
padre-qi
. Logging for the query independent evidence index build process. (previously called:Step-QueryIndependentEvidenceCollectionLevel.log
)
Step-SetGscopes.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
padre-gs
. Logging for application of data source level gscopes.
Step-FacetBasedGscopes.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output for setting up query facets.
Step-MoveTmpIndexIntoPlace.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output for index move step.
Step-ClearGscopes.log
-
Location: Data source - live/offline logs
-
Purpose: Logs the output from
padre-gs
when clearing any previously set gscopes.
Step-ExactMatchKill.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output from
padre-fl
for killing documents specified inkill_exact.cfg
.
Step-FacetedNavigation.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages relating to the building of query-based faceted navigation.
Step-PartialMatchKill.log
-
Location: Data source - live/offline logs
-
Purpose: Logs output from
padre-fl
for killing documents specified inkill_partial.cfg
.
Step-SecondaryIndex.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages relating to the creation of secondary indexes for an instant update.
Step-SetExtraRegexGscopes.log
-
Location: Data source - live/offline logs
-
Purpose: Logs messages relating to the setting of gscopes from other sources.
Push data source logs
Step-BuildCollapsingSignatures.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
padre-cc
. Logging for the collapsing index build process.
Step-BuildMatchOnlyIndexForAutoc.log
-
Location: Data source - push live logs
-
Purpose: Logs message relating to the creation of a match only index which is used to build profile-scoped auto-completion indexes.
Step-ClearGscopes.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
padre-gs
when clearing any previously set gscopes.
Step-Index.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
padre-iw
. Logging for the index build process.
Step-KillDocuments.log
-
Location: Data source - push live logs
-
Purpose: Logs messages relating to the killing of documents as part of maintaining the integrity of the search index. Note: push data sources do not support the use of kill configuration files - the API should be used for this.
Step-QueryIndependentEvidenceCollectionLevel.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
padre-qi
. Logging for the query independent evidence index build process.
Step-SetExtraRegexGscopes.log
-
Location: Data source - push live logs
-
Purpose: Logs messages relating to the setting of gscopes from other sources.
Step-SortDocumentUrls.log
-
Location: Data source - push live logs
-
Purpose: Logs messages relating to the sorting of index.urls in order to speed up document kill operations.
Step-annie-a.log
-
Location: Data source - push live logs
-
Purpose: Logs messages relating to the building of annotation indexes.
Step-BuildAutoCompletion.RESULTS-PAGE-ID.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
build_autoc
. Logging for each auto-completion index. An auto-completion index is built for each profile.
Step-BuildAutoCompletion.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
build_autoc
. Logging for the auto-completion index build.
Step-BuildSpelling.log
-
Location: Data source - push live logs
-
Purpose: Logs the output from
build_spelling_index
. Logging for the spelling index build process.