Assess an existing search for upgrade

This guide covers areas that you need to focus on when assessing an existing search for upgrade.

Server checks

Check these when beginning an audit on a new server:

Is any non-standard functionality set up to wrap Funnelback?

This check only applies for Squiz Cloud and on-premises instances.

Check for anything set at a server level or in the Funnelback global configuration that might be non-standard. Includes things like:

  • Firewall configurations

  • OpenResty or similar

Is OpenResty or similar used to perform any URL rewrites?

If OpenResty is being used to rewrite URLs then this will impact on the ability to provide a like for like upgrade.

There is no facility for URL rewriting within the DXP, Funnelback multi-tenanted or dedicated hosted environments.

Open Resty is commonly used to map admin requests on HTTPS port 443 to the admin context in Jetty, and other requests to the HTTP public context in Jetty. This is not relevant in the DXP, Funnelback multi-tenant or dedicated environments as the admin service has a dedicated administration URL.

Is the server OS Linux? (self-hosted instances only)

If the OS is Windows then the upgrade is a lot more complicated and may not be possible at all depending on what features are in us. There is no upgrade path to v16 if the following functionality is used:

  • DLS using Windows AD

  • trimpush repositories

  • Windows fileshares (filecopy repository) with DLS configured.

Is there any server-level configuration for this customer

check the global configuration folder $SEARCH_HOME/conf and check:

  • redirects.cfg

  • dns_aliases.cfg

Is there any multi-server configuration? (Squiz Cloud or on-premises only)

Check the global configuration for multi-server configuration (such as query_processors keys etc.)

Check to see if any of the collections are using multi-server mediator calls like push-collection, pull-logs

This configuration will need to be removed when moving into the DXP or multi-tenanted/dedicated hosting, which include their own standardized multi-server implementations.

Are there non-standard cron jobs or scheduled tasks (Squiz Cloud or on-premises only)

Check the crontab for any jobs that don’t relate to standard collection or analytics updates.

If there are other jobs configured, asses what these do and how they are used in the search.

Is the server hosting any other non-standard services that are used by Funnelback?

Check for things like:

  • FTP/SFTP services that might be used to allow customers to upload content.

  • Things such as databases or websites also hosted on the server (this is never recommended for a Funnelback server, but you do see this sometimes).

If there are other services configured, assess what these do and how they are used in the search solution.

General checks

Check this when you start auditing a new customer. Note some things might apply multiple times for a customer (e.g. a customer may have multiple custom domain names associated with a single collection, or with different collections that they manage that correspond to different searches)

Does the search on the customer site use a custom domain name (like search.example.com)?

  • If yes, this will need to be transferred to the new environment.

Are the search results protected with a login? (if this is provided by Funnelback then a dedicated service will be required)

If yes, then you need to carefully check what type of authentication is being used.

  • If the customer search is wrapped in a CMS that handles the authentication then it should continue to work in the same manner in the DXP.

  • The DXP supports a token based authentication using the access restriction to search results plugin.

  • If HTTP basic authentication is being used via a servlet filter hook script then this will need to be updated to the plugin mentioned above, or it will require dedicated hosting.

  • If a customer SAML is being used then the search will need to be hosted within the Funnelback dedicated hosting environment.

Is the search admin using a customer SAML for logins?

If yes, customer can only be hosted in the Funnelback dedicated hosting environment.

Collection-level checks

This is the main focus when auditing

Non standard configuration files

Check for non-standard configuration files in the collection’s conf folder, and for @groovy and @workflow folders that contain files that are called from the configuration.

Non-standard configuration files will probably be used by workflow or by a groovy filter. It is also possible that it might be old and unused. When checking an unknown configuration file make sure you also look at old Funnelback documentation as it might be a deprecated configuration file that doesn’t exist in v16.

Check for non-profile folders

Profile folders are sub-folders in the collection’s conf folder and are named PROFILE and PROFILE_preview. Any non-profile folders should be prefixed with an @ (you will commonly see @groovy and @workflow - the @ causes the Funnelback to ignore the folder and not treat it as a profile. If you see a folder like workflow in there that can cause problems for the upgrade and these should be removed before an upgrade is commenced.

Legacy best bets

Check for existence of best_bets.cfg (either at the collection or profile level)

If this exists any entries will need to be re-entered as best bets on the upgraded collection. Note: there is a conversion script floating around somewhere that can convert this.

collection.cfg checks

Open collection.cfg and check the file for the following, noting anything you find.

collection_type

If local then this will need to be converted to a web or custom data source

The decision will depend on what the collection is doing. Generally speaking most will be converted to a web collection and this will apply to anything that just downloads a file (or maybe a few files). If it’s a repository where there is a plugin available then you’ll want to use that plugin (note - it’s possible that the plugin might need extending if it doesn’t support all the required functionality). If there weird logic required to get the file that the crawler can’t support then you’d need a plugin.

If custom then you will need to either convert the custom_gather.groovy to a plugin, or convert the collection to one of the following built-in types.

Most likely built-in types are:

  • facebook, youtube, twitter, flickr: early implementations of these used a custom collection

  • web: if the custom gatherer is just fetching some things via HTTP calls that can be set as start URLs or crawled, e.g. with a custom link extraction pattern.

If database, filecopy or directory then it’s unlikely these will work when moved into the hosted environment. These are enterprise collection types that rely on direct access to the repositories and this tends to only be allowed when the Funnelback server is on an internal customer network as firewalling etc. will block access to these services.

In the DXP and Funnelback multi-tenanted/dedicated environments you can’t install database drivers so if you have a database collection that doesn’t use either the postgres or sqlite driver then this won’t work in the hosted environment. There may be a chance for getting it working in Funnelback dedicated but you will need to negotiate with the hosting team.

For these collection types you will usually need to look at a strategy to rebuild these collections as either a web or custom data source that accesses the data via an exported feed or API, or by using Squiz Connect (if you are moving into the DXP).

If trim then you can’t move as TRIM only works within a Windows environment and has a lot of dependencies that must be configured within the customer’s network to facilitate the authentication.

If matrix then this will need to be converted to a web data source.

If slackpush then this can’t be migrated as it is not supported in v16.

workflow commands

These are any scripts/shell commands run from pre_* and post_* config keys. Functionality performed by these will need to be converted into plugins or built-in functionality, or if it makes more sense an extension to product functionality (like a new curator trigger or action).

The hard part here is understanding what each workflow script is doing and then deciding how it might be converted into current functionality.

The only exceptions are:

  • Calls to post_update_copy.sh - this is a function that is used in the Funnelback hosted environment and can be removed when moved into the DXP.

  • Calls to the mediator.pl - for handing of multi-server functionality. This will need to be removed when moving into the DXP or Funnelback multi-tenanted/dedicated.

Configuration keys with out-of-range values, non-standard default or restricted access

The DXP, Funnelback multi-tenanted and dedicated environments have some restrictions applied to configuration keys.

These environments have some server-level overrides applied as part of the server management.

These environments prevent access to some configuration keys that can compromise the system security. This is stricter in the DXP with the dxp~developer role preventing access to a set of non-safe keys, and also imposing various range restrictions on the values that can be set.

These range restrictions are applied to the certain keys that might allow the setting of dangerous values (things like memory heap settings, thread counts, timeouts).

custom configuration keys

This applies to the DXP only

These are often prefixed with custom. but this isn’t a rule. Custom keys with a ui.custom. prefix are allowed in the DXP, but these are really only for use with Freemarker templates in v16 - plugins use custom keys with a prefix matching the plugin’s ID and there shouldn’t be any other need for custom keys in v16 aside from in templates. Keys with a stencils.* prefix are also supported in the DXP.

filter.classes

Identify any custom filters (there is a list of built-in filters in the documentation). Anything outside this must be converted to other functionality or a plugin.

Filters that force types such as ForceXMLMime can usually be removed and are usually only required if the source data provides an incorrect MIME type. This should be fixed at the source where possible.

filter.jsoup.classes

Identify any custom filters (there is a list of built-in filters in the documentation). Anything outside this must be converted to other functionality or a plugin.

indexer_options

Check for any weird/non-standard options. Some things to look out for (not exhaustive):

  • -ifb is redundant and can be removed

  • -future_dates_ok is redundant and can be removed

  • -GSB is redundant and can be removed

  • Look out for very large -mdsfml values

  • Look out for large values of -CHAMB

  • Settings such as -forcexml and -utf8input can usually be removed, and are usually only required if the source data is malformed (e.g. sets an incorrect MIME type) or is not well formed (such as XML that is missing a declaration line). The source data should be fixed where possible.

query_processor_options

Check the query processor options for any weird/non-standard options. Some things to look out for (not exhaustive):

  • -SF=[.*]: (or similar) is not best practice and should be converted to the set of metadata fields used by the search. Using wildcards can cause the response packet to be very large and affect performance.

  • -DAAT=0, -service_volume=low: is likely to cause problems with compatibility of features (this is an old mode of processing queries which is not compatible with a lot of newer features. Note - it’s required if you use -events though (another thing that should be noted if it is found)

  • -events: Old events mode. This feature is a bit experimental and doesn’t work well with other Funnelback functionality. If found you should look at options for converting this.

  • large SBL/MBL values

Custom filters

Custom filters can be found by looking at the filter.classes or filter.jsoup.classes keys in collection.cfg. If both keys are not present then you can skip this step.

If present, then they can include a mixture of built-in and custom filters.

Each custom filter will need to be checked and understood. You only need to upgrade filters that are listed in either the filter.classes or filter.jsoup.classes collection.cfg keys (you will sometimes find unused filters hanging around from old/discontinued functionality). And built-in filters do not need to be upgraded.

filters provided by stencils are OK (access to these is provided by the legacy stencils plugin) but ideally these should be converted to either product functionality or plugins.

Check each filter to determine what functions are being performed. For each function you’ll need to upgrade to one of the following:

  • change to product-standard functionality or a built-in filter

  • change to a plugin

  • write (or extend) a plugin

  • Remove the filter and adjust the configuration of your collection to work without it

Many old implementations will use a custom metadata scraper filter - if the filter class is not a Jsoup MetadataScraper then a custom, probably early prototype version, is in use. These will need to be replaced with the built-in filter, but upgrading to use this should be fairly straightforward

Hook scripts

You only need to worry about hook scripts that are on collections that you are querying directly. Hook scripts on non-meta collections that are being queried will need to be replace with functionality on the search package/results page(s) that will be serving the equivalent search in the DXP.

Hook scripts are found in the collection’s conf folder. The scripts are named hook_*.groovy. If these files don’t exist, or are empty then you can skip this step.

Check each hook script and note what functions each perform.

Each function will need to be upgraded to one of the following:

  • change to product-standard functionality (such as curator rules)

  • change to a plugin

  • write a plugin

  • Remove the hook script and adjust the configuration of your collection to work without it

Workflow commands/scripts

Workflow commands and scripts are found by looking at collection.cfg pre_* and post_* commands. These can be inline shell commands, or run a script. If these keys are not set then you can skip this step.

Check each workflow command/script and assess what functions are implemented. Each function will need to be upgraded to one of the following:

  • change to product-standard functionality (such as curator rules, or a web collection if the script just curls content)

  • change to a plugin

  • write a plugin

  • Remove the workflow script and adjust the configuration of your collection to work without it

Any workflow used to generate auto-completion (build_autoc commands or groovy/shell/perl scripts that generate auto-completion) must be replaced with the auto-completion plugin. The build-autoc program cannot be called from the command line in v16, even in the non-DXP environments that have limited support for legacy functionality.

Calls to padre-fl, padre-qi or padre-gs can probably be removed (as these are now run automatically) - but you should carefully check what the command is doing to ensure you don’t need to make any other changes. ( e.g. when the built-in command runs, it expects the entries to be present inside the preset configuration file names; and some extended command line options are not supported). Also these are sometimes used to apply gscopes or kills that are generated from pages in the index and this will need sto be converted to use the query-gscopes/kill/qie functionality supported in the DXP.

post-update-copy.sh commands can be ignored and removed - this is a FB MT/Dedicated extension in the older hosted environment that is still used but applied outside of the collection.cfg now. mediator-based multi-server commands also need to be removed.

Note down any workflow that is in use, what it does (roughly) and how it might be upgraded.

Custom curator triggers and actions

Check for custom curator triggers and actions

These can be found by checking the curator.json file in each profile folder for triggers of type: Groovy; or actions if type: GroovyAction.

If you find any custom curator triggers or actions you need to contact the R&D team to discuss options as these are not currently supported in the DXP (there is a task to extend the plugin framework to support these, but they are very rarely used, so it hasn’t been prioritized).

If they implement something that’s reusable the functionality should be replaced with a supported new curator trigger or action.

Groovy facet sort comparators

These can be found by checking the faceted_navigation.cfg for each profile for any facets that set a CUSTOM_COMPARATOR

If you find any custom sort comparators, assess what they are doing. It is likely that the existing facet-custom-sort plugin will provide the functionality that you require (it enables you to set a list of categories to be placed at the start or end of a sort. If this plugin doesn’t provide the required functionality you will either need to extend the existing plugin, or create a new plugin.

Figure out replacements for customizations

Once you have figured out what customizations are applied remove the old groovy code and replace it with a set of plugins. You might need to create some new plugins as part of this process, or request or submit some improvements to existing ones. Please note that some customizations can also be replaced with the built-in filters such as the metadata scraper or metadata normalizer filters.

When assessing custom functionality break it down into atomic reusable bits of functionality (so a single filter or hook script might be replaced with a set of plugins that are mixed and matched and reusable by other customers).

Only require a customer-specific plugin if the functionality is so custom that it will only ever apply to a single customer.

Common replacements:

Current function Location Replacement

Modifications to the result title

hook script (post-process or post-datafetch)

Use the clean-title plugin

Modifications to the result title

custom filter

Use the clean-title plugin

Updates the result URL

hook script (post-datafetch)

Use the alter-live-url plugin

Extracts some document content and adds it as metadata

custom filter

Use the metadata-scraper built-in filter.

Clones or combines metadata into a new metadata field

custom filter

Use the combine-metadata-filter plugin

Adds some additional metadata based on matches to the URL

custom filter

Use the add-metadata-to-url plugin

Alters metadata/XML/JSON field values

custom filter

Use the metadata-normalizer built-in filter

Generates auto-completion

workflow commands (post_index_command)

Use the auto-completion plugin

Enables empty/null queries

hook script (pre-process or pre-datafetch)

Use the query-set-default plugin

Enables wildcard queries

hook script (pre-process or pre-datafetch)

Use the query-wildcard-support plugin

Sets a default sort (including for specific tabs)

hook script (pre-process or pre-datafetch)

Use curator

Generates QIE from the index

workflow commands (post_index_command or later workflow)

Use the built-in query-qie.cfg configuration file.

Generates gscopes from the index

workflow commands (post_index_command or later workflow)

Use the built-in query-gscopes.cfg configuration file.

Generates kill configuration from the index

workflow commands (post_index_command or later workflow)

Use the built-in query-kill.cfg configuration file.

Downloads auto-completion or query completion CSV

workflow commands (pre_gather_command, post_gather_command or pre_index_command)

Use the built-in configuration option

Downloads files using basic curl or wget command

workflow commands (pre_gather_command)

Use a web collection

Downloads external metadata

workflow commands (pre_gather_command, post_gather_command or pre_index_command)

Use the external-metadata-fetcher plugin. If Matrix is generating the external metadata use paginated metadata setup using this process.

Sets input parameters

hook scripts (pre-process or pre-datafetch)

Use curator

Validates external metadata

Note: concatenate mode is not supported in v16, however it only applies when there are external metadata configuration files stored within profile folders of a collection. If there are no profile-level files then this can safely be removed.

workflow commands (pre_gather_command, post_gather_command or pre_index_command)

Use built-in external metadata validation

Adds stop words to the data model

Note: the hook script to add stop words to the data model is part of the legacy auto-completion code.

hook scripts (post-process)

Use the auto-completion plugin

Normalize

Note: the hook script to add stop words to the data model is part of the legacy auto-completion code.

hook scripts (post-process)

Use the auto-completion plugin