Diagnosing Collection Update Failures
Each time a collection is updated, an update log containing status information about each stage of the update process is created. These update logs are named
update-<collection>.log and can be viewed in the log viewer.
The update log contains several lines relating to the completion status of each stage in the update process. Depending on the collection type, the update stages include:
- Crawl pages from the Web or extract content from a database or copy files from a filesystem (
- Filter binary documents such as Word and PDF (
- Eliminate duplicate documents (
- Index the documents (
- Swap the live and offline indexes.
If any of the stages fail then the update log will reveal a non-zero exit code for that stage and a message stating that the stage has failed.
Sometimes it is possible to immediately diagnose the problem from the error message in the update log. Otherwise it will be necessary to examine the detailed log file of the stage that failed. These detailed logs can be viewed by clicking the links under the "Offline log files" folder in the log viewer.
Crawler halted early or failed to start
If the update log indicates a crawl failure then examine the
crawl.log. Common problems include:
- Licence key failure: Check that the licence key valid is installed and that the search machine has been configured with a fully qualified domain name. Sometimes adding an entry containing the full hostname to the system hosts file will correct this problem.
- Java not installed correctly: Check that Java is installed on the search machine and that the "java" setting in SEARCH_HOME/conf/executables.cfg is set correctly.
- Crawler can't access the seed page: The crawler may be blocked from the seed page as a result of network authentication, network problems, incorrect network protocol (e.g. specifying http instead of https).
Can't stop or start a collection update (stale lockfile)
If a server loses power or is abnormally interrupted during an update, a lock file indicating that the update is still in progress may be left behind, preventing new updates from beginning, and preventing the update from being stopped with the message "The collection 'x' is not currently being updated".
Such stale lock files must be removed manually, either from the command line or the Windows GUI.
Note : Before removing lock files, please ensure that the update has been terminated by checking the servers process list for any perl processes running the update.pl script.
The lock files to remove to in this situation will be located at:
$SEARCH_HOME on windows is
C:\funnelback, and on Unix systems it is
No response from server when querying
- On the command line, you can check network access for your Funnelback host with ping
. If you get a response, the network is probably OK and the fault is with the Funnelback server. If there is no response, the problem could be either a network fault or a Funnelback server fault.
- Go to the Funnelback server and check that it is powered on. Then check the network connections.
- Try to log in at the Funnelback console and look for error messages which indicate hardware failure.
- Check that the Web server is running. IIS Users: If manual configuration of the IIS settings has caused the server to stop responding then running the
SEARCH_HOME\bin\setup\configure-iis-for-funnelback.plscript will reconfigure the Funnelback related IIS configuration.
- Check that there is no firewall blocking access to the search interface (normally port 80 or 8080) or the administration interface (normally port 8443 or 443).
- If there is no obvious hardware fault, halt the system (if you can).
- Power off the system for about 20 seconds and then power it on again. It should reboot. Rebooting may take quite a long time if it is necessary to check filesystems which didn't unmount cleanly.
- If there is a hardware fault, call the appropriate service number.
- If a disk drive has failed, you may need to restore from backup. Otherwise you should be able to power up and continue normal operation.
Pages missing from the index
Is the page really missing? For a URL http://www.missing.com/subdir/page
you can try the Funnelback query u:www.missing.com v:subdir/page. Assuming it is a plain query (no query scopes are in effect) this should show whether the page is indexed.
Check the collection's include and exclude lists, to make sure the page is meant to be included. These are visible in the Edit collection page, or in collection.cfg as include_patterns and exclude_patterns.
Check the web site itself, to see whether it allows its pages to be indexed. Web sites can request that certain pages be excluded, using robots.txt or robot meta tags, and Funnelback will obey such directives. If the missing page is
www.missing.com/subdir/page then check
www.missing.com/robots.txt to see if Funnelback is allowed to index the page. Similarly check whether page contains a NOINDEX or NOFOLLOW directive in a meta tag.
Check whether the pages are linked in. A common reason for a site or page to be missing is because nobody links to it. The ideal solution is to create links in pages which are already indexed, but another option is to simple add the page to the list of start URLs . One way of checking whether the page is linked to is to type the query h:"part.of.url/xyz".
Finally, you can log into the search box, change to the collection logfile directory (normally
/opt/funnelback/data/_collection_/live/log). You can then search for any (error) messages relating to the URL to get more information e.g. zgrep URL crawl.log.*
Visual Studio Just-In-Time Debugger pop-up windows
Some external programs used by Funnelback for converting from one file type to another may not correctly handle all files, especially if the source file is corrupt. If such a program crashes, any server which has Visual Studio also installed may present a "Visual Studio Just-In-Time Debugger" window requiring user interaction before the conversion process can continue.
To avoid the need for this type of user interaction, the just-in-time debugger can be disabled by following the steps described by Microsoft at http://msdn.microsoft.com/en-us/library/5hs4b7a6.aspx.
Public search interface warnings
Some situations during Funnelback operation may cause failures during query processing which need to be addressed by a search administrator. Rather than display a warning message to the end user, these situations are logged to a file located at
$SEARCH_HOME/log/public-ui.warnings, which should always be writeable by all users (to ensure that the public search interface can write to this file.
When administrator action is required, the administration interface will display one of the messages below on the administration home page to alert the search administrator.
Warnings from the search interface are unavailable
This message will be displayed if the
$SEARCH_HOME/log/public-ui.warnings file does not appear to be writeable by all users. Since the public search interface may run as a different user to the administration interface, this file needs to be writeable by all users to ensure that the public interface may log warnings to it.
If this warning is displayed, please change the permissions on the
$SEARCH_HOME/log/public-ui.warnings file to allow all users to write to it. On Linux systems, this can be achieved by running the following command.
chmod 666 $SEARCH_HOME/log/public-ui.warnings
Warnings have been logged by the search interface
This message is displayed if the
$SEARCH_HOME/log/public-ui.warnings file contains any content, and is intended to alert the administrator to warning messages in this file which may require action. For example, the file may contain warning messages indicating the the query log files are not writeable by the public interface, which could then be addressed by the search administrator.
After addressing any warnings noted in the file, the administrator should remove all lines from the file to remove the warning from the administration interface.
Redis memory usage
The Redis memory storage system is used to store information for a number of Funnelback features. In very low memory situations, Redis may be unable to allocate enough memory to save a snapshot of the data it stores, resulting in an error message such as the following.
redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
Linux Funnelback servers can overcome this by turning on the kernel's overcommit_memory setting.
Groovy command line scripts
The Groovy programming language is used by Funnelback to support customisation of a number of features. On machines with a pre-existing installation of Groovy it may be necessary to set the GROOVY_HOME environment variable to the version included in Funnelback (i.e. GROOVY_HOME=$SEARCH_HOME/tools/groovy-#.#.#/), otherwise the other (possibly incompatible) version of Groovy will be used.