Supporting multiple query processors

Overview

Supporting multiple query processor is usually motivated by two requirements:

Architecture

To address these two problems you can configure Funnelback to work in a live-live query processor configuration:

All the query processor machines will be active and able to serve queries at the same time. The incoming search requests need to be distributed across the query processor servers using an external system, usually a load balancer. In the same fashion taking a query processor server offline for maintenance will be done at the load balancer level: Funnelback does not provide a facility to balance the search requests among all the query processor machines.

Workflow

This scenario is supported by configuring additional workflow steps to:

Support facilities

The additional workflow steps make use of two facilities to transfer files and trigger actions remotely on the query processors: WebDAV and custom web services. Those facilities are available over standard protocols and can be interacted with for other purposes if needed, either using Funnelback-provided command line tools or programmatically.

WebDAV

File transfers between two Funnelback servers relies on the WebDAV protocol. WebDAV is an extension of HTTP adding methods to write to a remote server (Upload a file, create a folder, etc.), allowing existing tools (WebDAV clients) and libraries for your preferred language to interact with it.

Funnelback runs a WebDAV server whose root is the Funnelback installation folder ($SEARCH_HOME). It runs over HTTPS and on the default port 8076.

Any configuration change requires a restart of the Funnelback Daemon service to take effect.

The WebDAV service requires authentication and uses the same user database as the Admin UI. Only administrator users can access the WebDAV service.

WebDAV being an extension of HTTP, you can easily access it with your web browser: Simply point it to https://funnelback-server.com:8076/webdav/ and enter administrator credentials.

Web services

Funnelback offers a set of web services that can trigger local or remote actions, such as transferring a complete collection to a remote host, or remotely swapping the views of a collection.

Those web services can be invoked using the $SEARCH_HOME/bin/mediator.pl command line or REST (over HTTPS). The REST web services are deployed on Jetty, under the Admin UI context: /search/admin/mediator/.

To access a list of available web services, either:

For example, to push the intranet collection to a remote machine you could use:

mediator.pl PushCollection collection=intranet host=qp01.domain.com

Similarly, to remotely swap the views on a query processor server:

mediator.pl SwapViews@qp01.domain.com collection=intranet

The specific commands that are used to support multiple query processors are: PullLogs, SwapViews, PushConfigFile, PushView and PushCollection.

For more information, please consult the mediator.pl documentation page.

Configuration

List of query processors

The list of query processor machines is configured either in each collection collection.cfg, or globally in $SEARCH_HOME/conf/global.cfg using the query-processors=... setting. This setting should contain the comma-separated list of fully-qualified host names for the query processors, and should be set on the admin server:

query-processors=qp01.domain.com,qp02.domain.com

The admin server and the query processors must all share the same server secret, see server secret (global.cfg) for details on how to check and change the server_secret.

The Admin UI should also use the same port between the admin server and the query processors, in global.cfg(.default):

 urls.admin_port=8443
 webdav-service.port=8076

It is recommended to disable WebDAV on the administration server as this will prevent any of the query processors pushing to the admin server. In global.cfg on the admin server:

 daemon.services=GenerateCertificates,FilterService

The Admin UI should be set to read only mode for the query processors, in global.cfg(.default) on each query processor:

 admin.read-only-mode=true

Initial publication

Collections created on the administration server must be initially transferred manually to the remote query processors. To do so use the following command under $SEARCH_HOME/bin/:

 $SEARCH_HOME/linbin/ActivePerl/bin/perl mediator.pl PushCollection collection=my_collection host=qp01.domain.com

...and repeat for each query processor. Note that this operation might take a while depending of the size of your index, but it will eventually complete.

Publication of configuration files

In order to publish modified configuration files to the remote query processors you need to set workflow.publish_hook=$SEARCH_HOME/bin/publish_hook.pl. It can be added on per-collection basis to the collection collection.cfg, or for all the collections in $SEARCH_HOME/conf/collection.cfg.

This setting will cause the hook script to be called when a file is published from a preview profile to a live profile using the Publish button in the file manager. The hook script will iterate on the query processors list and push the updated configuration file to each one.

Non profiles files cannot be published by default. To enable publication $SEARCH_HOME/conf/file-manager.ini needs to be edited. Under the section [file-manager::home] add a new line publish-to = REMOTE:

[file-manager::home]
name = Config
path = $home
rules = main-rules
deletable = false
# Enable this in multi-server setups
publish-to = REMOTE

REMOTE is a special publication target that will allow non-profile configuration files to be published. This will enable a Publish button in the file manager for each non-profile file.

Note: Funnelback doesn't track the publication status of non-profile files. The Publish button will always be displayed, regardless of whether the file has changed or not since the last publication.

Collection update workflow

New indexes and data must be transferred to the query processors when a collection is updated. To do so edit collection.cfg and set a post_archive:

post_archive_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/publish_index.pl $COLLECTION_NAME

This script will iterate over the query processors and will:

Query reports and Analytics

Analytics update are run on the Admin server. In order for the Admin server to present accurate reports the clicks and queries logs must be pulled from the remote query processors before the reports are updated. To do so you need to add a pre_reporting_command in collection.cfg:

pre_reporting_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/pull_logs.pl $COLLECTION_NAME

This script will transfer the logs from the query processors into the archive folder of the collection on the Admin server. Only log files that are not already present will be transferred.

Meta collections

All the above steps needs to be configured for meta collection, except for the collection update workflow (indexes pushing). Because meta collections are not updated like other collections, setting a post_archive_command will have no effect. Moreover, there's no index to transfer for a meta collections. The only thing that needs to be transferred to the query processor is the list of component collections. To do so you can define workflow.publish_hook.meta in the meta collection configuration:

workflow.publish_hook.meta=$SEARCH_HOME/bin/publish_index.pl $COLLECTION_NAME

This script will be called whenever the meta collection is edited, such as when an administrator changes the list of component collections. It will push the live/idx folder that contains the list of components collections in the index.sdinfo file.

When query completion or spelling suggestions are enabled on the meta collection the suggestion indexes are rebuilt whenever a component collection is updated. To make sure that the new suggestions indexes are published to the query processors you also need to update the post_meta_dependencies_command on each component collection to publish the index of the meta collection:

post_meta_dependencies_command=$SEARCH_HOME/linbin/ActivePerl/bin/perl $SEARCH_HOME/bin/publish_index.pl meta-collection-name

Note: If the component collections interact with their parent meta collection in post_update_command (or other workflow commands) then you must update the workflow accordingly to push the meta collection after it has been modified.

The pre_reporting_command also needs to be setup in the meta collection configuration file (as mentioned in the previous section) if you want to build Analytics reports for the meta collection.

Note: For the pull_logs.pl script to get all relevant logs from remote Meta collections, manual log archiving will need to be set up on those machines. This is since meta collection logs are normally not rotated or archived as there is no update event.

Troubleshooting

The following logs might be helpful in diagnosing issues:

If connection refused errors are logged check that the server_secret is set correctly on all machines, and that WebDAV and Admin UI HTTPS ports are allowed by firewalls between the admin the query processor machines.

Going further

Various alternatives are provided if you need to perform actions or transfer files between two Funnelback servers that not in the scope of this scenario:

In some cases, you may want to share the same Redis dataset between the query processors and the admin server. To do so, read configuring Redis to work with multiple servers.

See also