Assess an existing search for upgrade
This guide covers areas that you need to focus on when assessing an existing search for upgrade.
Server checks
Check these when beginning an audit on a new server: |
Is any non-standard functionality set up to wrap Funnelback?
This check only applies for Squiz Cloud and on-premises instances. |
Check for anything set at a server level or in the Funnelback global configuration that might be non-standard. Includes things like:
-
Firewall configurations
-
OpenResty or similar
Is OpenResty or similar used to perform any URL rewrites?
If OpenResty is being used to rewrite URLs then this will impact on the ability to provide a like for like upgrade.
There is no facility for URL rewriting within the DXP, Funnelback multi-tenanted or dedicated hosted environments.
Open Resty is commonly used to map admin requests on HTTPS port 443 to the admin context in Jetty, and other requests to the HTTP public context in Jetty. This is not relevant in the DXP, Funnelback multi-tenant or dedicated environments as the admin service has a dedicated administration URL. |
Is the server OS Linux? (self-hosted instances only)
If the OS is Windows then the upgrade is a lot more complicated and may not be possible at all depending on what features are in us. There is no upgrade path to v16 if the following functionality is used:
-
DLS using Windows AD
-
trimpush repositories
-
Windows fileshares (filecopy repository) with DLS configured.
Is there any server-level configuration for this customer
check the global configuration folder $SEARCH_HOME/conf and check:
-
redirects.cfg
-
dns_aliases.cfg
Is there any multi-server configuration? (Squiz Cloud or on-premises only)
Check the global configuration for multi-server configuration (such as query_processors keys etc.)
Check to see if any of the collections are using multi-server mediator calls like push-collection, pull-logs
This configuration will need to be removed when moving into the DXP or multi-tenanted/dedicated hosting, which include their own standardized multi-server implementations.
Are there non-standard cron jobs or scheduled tasks (Squiz Cloud or on-premises only)
Check the crontab for any jobs that don’t relate to standard collection or analytics updates.
If there are other jobs configured, asses what these do and how they are used in the search.
Is the server hosting any other non-standard services that are used by Funnelback?
Check for things like:
-
FTP/SFTP services that might be used to allow customers to upload content.
-
Things such as databases or websites also hosted on the server (this is never recommended for a Funnelback server, but you do see this sometimes).
If there are other services configured, assess what these do and how they are used in the search solution.
General checks
Check this when you start auditing a new customer. Note some things might apply multiple times for a customer (e.g. a customer may have multiple custom domain names associated with a single collection, or with different collections that they manage that correspond to different searches) |
Does the search on the customer site use a custom domain name (like search.example.com)?
-
If yes, this will need to be transferred to the new environment.
Are the search results protected with a login? (if this is provided by Funnelback then a dedicated service will be required)
If yes, then you need to carefully check what type of authentication is being used.
-
If the customer search is wrapped in a CMS that handles the authentication then it should continue to work in the same manner in the DXP.
-
The DXP supports a token based authentication using the access restriction to search results plugin.
-
If HTTP basic authentication is being used via a servlet filter hook script then this will need to be updated to the plugin mentioned above, or it will require dedicated hosting.
-
If a customer SAML is being used then the search will need to be hosted within the Funnelback dedicated hosting environment.
Collection-level checks
This is the main focus when auditing |
Non standard configuration files
Check for non-standard configuration files in the collection’s conf folder, and for @groovy
and @workflow
folders that contain files that are called from the configuration.
Non-standard configuration files will probably be used by workflow or by a groovy filter. It is also possible that it might be old and unused. When checking an unknown configuration file make sure you also look at old Funnelback documentation as it might be a deprecated configuration file that doesn’t exist in v16.
Check for non-profile folders
Profile folders are sub-folders in the collection’s conf folder and are named PROFILE and PROFILE_preview. Any non-profile folders should be prefixed with an @ (you will commonly see @groovy
and @workflow
- the @
causes the Funnelback to ignore the folder and not treat it as a profile. If you see a folder like workflow in there that can cause problems for the upgrade and these should be removed before an upgrade is commenced.
Legacy best bets
Check for existence of best_bets.cfg
(either at the collection or profile level)
If this exists any entries will need to be re-entered as best bets on the upgraded collection. Note: there is a conversion script floating around somewhere that can convert this.
collection.cfg checks
Open collection.cfg
and check the file for the following, noting anything you find.
collection_type
If local
then this will need to be converted to a web
or custom
data source
The decision will depend on what the collection is doing. Generally speaking most will be converted to a web collection and this will apply to anything that just downloads a file (or maybe a few files). If it’s a repository where there is a plugin available then you’ll want to use that plugin (note - it’s possible that the plugin might need extending if it doesn’t support all the required functionality). If there weird logic required to get the file that the crawler can’t support then you’d need a plugin.
If custom
then you will need to either convert the custom_gather.groovy
to a plugin, or convert the collection to one of the following built-in types.
Most likely built-in types are:
-
facebook, youtube, twitter, flickr: early implementations of these used a custom collection
-
web: if the custom gatherer is just fetching some things via HTTP calls that can be set as start URLs or crawled, e.g. with a custom link extraction pattern.
If database
, filecopy
or directory
then it’s unlikely these will work when moved into the hosted environment. These are enterprise collection types that rely on direct access to the repositories and this tends to only be allowed when the Funnelback server is on an internal customer network as firewalling etc. will block access to these services.
In the DXP and Funnelback multi-tenanted/dedicated environments you can’t install database drivers so if you have a database collection that doesn’t use either the postgres or sqlite driver then this won’t work in the hosted environment. There may be a chance for getting it working in Funnelback dedicated but you will need to negotiate with the hosting team.
For these collection types you will usually need to look at a strategy to rebuild these collections as either a web
or custom
data source that accesses the data via an exported feed or API, or by using Squiz Connect (if you are moving into the DXP).
If trim
then you can’t move as TRIM only works within a Windows environment and has a lot of dependencies that must be configured within the customer’s network to facilitate the authentication.
If matrix
then this will need to be converted to a web
data source.
If slackpush
then this can’t be migrated as it is not supported in v16.
workflow commands
These are any scripts/shell commands run from pre_*
and post_*
config keys. Functionality performed by these will need to be converted into plugins or built-in functionality, or if it makes more sense an extension to product functionality (like a new curator trigger or action).
The hard part here is understanding what each workflow script is doing and then deciding how it might be converted into current functionality.
The only exceptions are:
-
Calls to
post_update_copy.sh
- this is a function that is used in the Funnelback hosted environment and can be removed when moved into the DXP. -
Calls to the
mediator.pl
- for handing of multi-server functionality. This will need to be removed when moving into the DXP or Funnelback multi-tenanted/dedicated.
Configuration keys with out-of-range values, non-standard default or restricted access
The DXP, Funnelback multi-tenanted and dedicated environments have some restrictions applied to configuration keys.
These environments have some server-level overrides applied as part of the server management.
These environments prevent access to some configuration keys that can compromise the system security. This is stricter in the DXP with the dxp~developer role preventing access to a set of non-safe keys, and also imposing various range restrictions on the values that can be set.
These range restrictions are applied to the certain keys that might allow the setting of dangerous values (things like memory heap settings, thread counts, timeouts).
custom configuration keys
This applies to the DXP only |
These are often prefixed with custom.
but this isn’t a rule. Custom keys with a ui.custom.
prefix are allowed in the DXP, but these are really only for use with Freemarker templates in v16 - plugins use custom keys with a prefix matching the plugin’s ID and there shouldn’t be any other need for custom keys in v16 aside from in templates. Keys with a stencils.*
prefix are also supported in the DXP.
filter.classes
Identify any custom filters (there is a list of built-in filters in the documentation). Anything outside this must be converted to other functionality or a plugin.
Filters that force types such as ForceXMLMime
can usually be removed and are usually only required if the source data provides an incorrect MIME type. This should be fixed at the source where possible.
filter.jsoup.classes
Identify any custom filters (there is a list of built-in filters in the documentation). Anything outside this must be converted to other functionality or a plugin.
indexer_options
Check for any weird/non-standard options. Some things to look out for (not exhaustive):
-
-ifb
is redundant and can be removed -
-future_dates_ok
is redundant and can be removed -
-GSB
is redundant and can be removed -
Look out for very large
-mdsfml
values -
Look out for large values of
-CHAMB
-
Settings such as
-forcexml
and-utf8input
can usually be removed, and are usually only required if the source data is malformed (e.g. sets an incorrect MIME type) or is not well formed (such as XML that is missing a declaration line). The source data should be fixed where possible.
query_processor_options
Check the query processor options for any weird/non-standard options. Some things to look out for (not exhaustive):
-
-SF=[.*]
: (or similar) is not best practice and should be converted to the set of metadata fields used by the search. Using wildcards can cause the response packet to be very large and affect performance. -
-DAAT=0
,-service_volume=low
: is likely to cause problems with compatibility of features (this is an old mode of processing queries which is not compatible with a lot of newer features. Note - it’s required if you use-events
though (another thing that should be noted if it is found) -
-events
: Old events mode. This feature is a bit experimental and doesn’t work well with other Funnelback functionality. If found you should look at options for converting this. -
large
SBL
/MBL
values
Custom filters
Custom filters can be found by looking at the filter.classes
or filter.jsoup.classes
keys in collection.cfg
. If both keys are not present then you can skip this step.
If present, then they can include a mixture of built-in and custom filters.
Each custom filter will need to be checked and understood. You only need to upgrade filters that are listed in either the filter.classes
or filter.jsoup.classes
collection.cfg
keys (you will sometimes find unused filters hanging around from old/discontinued functionality). And built-in filters do not need to be upgraded.
filters provided by stencils are OK (access to these is provided by the legacy stencils plugin) but ideally these should be converted to either product functionality or plugins. |
Check each filter to determine what functions are being performed. For each function you’ll need to upgrade to one of the following:
-
change to product-standard functionality or a built-in filter
-
change to a plugin
-
write (or extend) a plugin
-
Remove the filter and adjust the configuration of your collection to work without it
Many old implementations will use a custom metadata scraper filter - if the filter class is not a Jsoup MetadataScraper then a custom, probably early prototype version, is in use. These will need to be replaced with the built-in filter, but upgrading to use this should be fairly straightforward
Hook scripts
You only need to worry about hook scripts that are on collections that you are querying directly. Hook scripts on non-meta collections that are being queried will need to be replace with functionality on the search package/results page(s) that will be serving the equivalent search in the DXP. |
Hook scripts are found in the collection’s conf folder. The scripts are named hook_*.groovy
. If these files don’t exist, or are empty then you can skip this step.
Check each hook script and note what functions each perform.
Each function will need to be upgraded to one of the following:
-
change to product-standard functionality (such as curator rules)
-
change to a plugin
-
write a plugin
-
Remove the hook script and adjust the configuration of your collection to work without it
Workflow commands/scripts
Workflow commands and scripts are found by looking at collection.cfg
pre_*
and post_*
commands. These can be inline shell commands, or run a script. If these keys are not set then you can skip this step.
Check each workflow command/script and assess what functions are implemented. Each function will need to be upgraded to one of the following:
-
change to product-standard functionality (such as curator rules, or a web collection if the script just curls content)
-
change to a plugin
-
write a plugin
-
Remove the workflow script and adjust the configuration of your collection to work without it
Any workflow used to generate auto-completion (build_autoc commands or groovy/shell/perl scripts that generate auto-completion) must be replaced with the auto-completion plugin. The build-autoc program cannot be called from the command line in v16, even in the non-DXP environments that have limited support for legacy functionality.
Calls to padre-fl
, padre-qi
or padre-gs
can probably be removed (as these are now run automatically) - but you should carefully check what the command is doing to ensure you don’t need to make any other changes. ( e.g. when the built-in command runs, it expects the entries to be present inside the preset configuration file names; and some extended command line options are not supported). Also these are sometimes used to apply gscopes or kills that are generated from pages in the index and this will need sto be converted to use the query-gscopes/kill/qie functionality supported in the DXP.
post-update-copy.sh commands can be ignored and removed - this is a FB MT/Dedicated extension in the older hosted environment that is still used but applied outside of the collection.cfg now. mediator-based multi-server commands also need to be removed. |
Note down any workflow that is in use, what it does (roughly) and how it might be upgraded.
Custom curator triggers and actions
Check for custom curator triggers and actions
These can be found by checking the curator.json
file in each profile folder for triggers of type: Groovy
; or actions if type: GroovyAction
.
If you find any custom curator triggers or actions you need to contact the R&D team to discuss options as these are not currently supported in the DXP (there is a task to extend the plugin framework to support these, but they are very rarely used, so it hasn’t been prioritized).
If they implement something that’s reusable the functionality should be replaced with a supported new curator trigger or action.
Groovy facet sort comparators
These can be found by checking the faceted_navigation.cfg
for each profile for any facets that set a CUSTOM_COMPARATOR
If you find any custom sort comparators, assess what they are doing. It is likely that the existing facet-custom-sort
plugin will provide the functionality that you require (it enables you to set a list of categories to be placed at the start or end of a sort. If this plugin doesn’t provide the required functionality you will either need to extend the existing plugin, or create a new plugin.
Figure out replacements for customizations
Once you have figured out what customizations are applied remove the old groovy code and replace it with a set of plugins. You might need to create some new plugins as part of this process, or request or submit some improvements to existing ones. Please note that some customizations can also be replaced with the built-in filters such as the metadata scraper or metadata normalizer filters.
When assessing custom functionality break it down into atomic reusable bits of functionality (so a single filter or hook script might be replaced with a set of plugins that are mixed and matched and reusable by other customers).
Only require a customer-specific plugin if the functionality is so custom that it will only ever apply to a single customer.
Common replacements:
Current function | Location | Replacement |
---|---|---|
Modifications to the result title |
hook script (post-process or post-datafetch) |
Use the clean-title plugin |
Modifications to the result title |
custom filter |
Use the clean-title plugin |
Updates the result URL |
hook script (post-datafetch) |
Use the alter-live-url plugin |
Extracts some document content and adds it as metadata |
custom filter |
Use the metadata-scraper built-in filter. |
Clones or combines metadata into a new metadata field |
custom filter |
Use the combine-metadata-filter plugin |
Adds some additional metadata based on matches to the URL |
custom filter |
Use the add-metadata-to-url plugin |
Alters metadata/XML/JSON field values |
custom filter |
Use the metadata-normalizer built-in filter |
Generates auto-completion |
workflow commands (post_index_command) |
Use the auto-completion plugin |
Enables empty/null queries |
hook script (pre-process or pre-datafetch) |
Use the query-set-default plugin |
Enables wildcard queries |
hook script (pre-process or pre-datafetch) |
Use the query-wildcard-support plugin |
Sets a default sort (including for specific tabs) |
hook script (pre-process or pre-datafetch) |
Use curator |
Generates QIE from the index |
workflow commands (post_index_command or later workflow) |
Use the built-in query-qie.cfg configuration file. |
Generates gscopes from the index |
workflow commands (post_index_command or later workflow) |
Use the built-in query-gscopes.cfg configuration file. |
Generates kill configuration from the index |
workflow commands (post_index_command or later workflow) |
Use the built-in query-kill.cfg configuration file. |
Downloads auto-completion or query completion CSV |
workflow commands (pre_gather_command, post_gather_command or pre_index_command) |
Use the built-in configuration option |
Downloads files using basic curl or wget command |
workflow commands (pre_gather_command) |
Use a web collection |
Downloads external metadata |
workflow commands (pre_gather_command, post_gather_command or pre_index_command) |
Use the external-metadata-fetcher plugin. If Matrix is generating the external metadata use paginated metadata setup using this process. |
Sets input parameters |
hook scripts (pre-process or pre-datafetch) |
Use curator |
Validates external metadata Note: concatenate mode is not supported in v16, however it only applies when there are external metadata configuration files stored within profile folders of a collection. If there are no profile-level files then this can safely be removed. |
workflow commands (pre_gather_command, post_gather_command or pre_index_command) |
|
Adds stop words to the data model Note: the hook script to add stop words to the data model is part of the legacy auto-completion code. |
hook scripts (post-process) |
Use the auto-completion plugin |
Normalize Note: the hook script to add stop words to the data model is part of the legacy auto-completion code. |
hook scripts (post-process) |
Use the auto-completion plugin |