Funnelback 15.14.0 release notes

Released: 22 March 2018

Supported until: 22 March 2021 (Long Term Support Version)

15.14.0 - New features

  • Administration of metadata mapping and xml indexing has been overhauled.

    • Adds more structure when defining metadata classes and sources for HTML and XML.

    • Provides suggestion of common metadata sources within the target index.

    • Adds clear inline help when configuring XML indexing options.

  • Underlying APIs for metadata mapping and xml indexing have been introduced.

  • Replaced web crawler HTTP processing, adding support for a range of modern HTTP protocol options, including gzip compression.

15.14.0 - Selected improvements and bug fixes

  • Accessibility Auditor now fetches documents for auditing via the same HTTP mechanism as the web crawler, allowing for form interaction, SSL and proxy settings to be applied.

  • Display of metadata fields in search results no longer requires the 'both' summary mode. Requested 'SF' summary field metadata is now returned regardless of the 'SM' summary mode.

  • Allow for content type and additional headers to be set per profile.

  • Fixed a number of errors in calculations for date facets.

  • Reduced memory usage in a number of components (e.g. metaspace when groovy scripts are reloaded, large padre packet compression in-memory).

  • Removed the accessibility auditor G4 check, which produced unhelpful noise on all documents.

  • Introduced a configurable limit on user-requested requested num_ranks - ui.modern.external_num_ranks_limit.

  • Unauthenticated users can now be blocked from preview profiles using the restrict_preview_to_authenticated_users option.

  • Facets are now accessible by name in groovy and freemarker templates by calling the getFacetByName method on the SearchResponse object for example transaction.response.getFacetByName("facet-name").

  • Social media custom templates are now configured via collection.cfg options, and have new metadata mappings.

  • The live URLs of documents from database and directory collections no longer refer to a legacy cgi script. The document’s cache URL is now used.

  • Tracking of user click actions no longer relies on HTTP referrer headers.

  • Fixed problems in handling of default ports when canonicalizing URLs.

  • Mediator PullLogs command now transfers sub-directories of logs recursively.

  • Removed some cases where XML documents were incorrectly treated as HTML.

  • Added support for Server Name Indication (SNI) when fetching SAML metadata.

  • Improved SAML configuration to allow independent configuration between admin and search.

  • Improved document title fixer to eliminate some additional undesirable titles.

  • Introduced post_collection_create_command setting to simplify initial collection creation in multi-server environments.

  • Introduced debugging API call for investigating crawler errors in form interaction.

  • Improved performance of query processing when very large numbers of metadata mappings are used.

  • Introduced an experimental option to prevent concurrent in-crawl form interaction - crawler.allow_concurrent_in_crawl_form_interaction.

  • Upgraded Freemarker library to 2.3.27 (from 2.3.25) which provides some new template syntax - See http://freemarker.org/docs/versions_2_3_27.html.

  • Upgraded to Tika 1.17 (from 1.10) which provides improved file filtering support - See https://github.com/apache/tika/blob/1.17/CHANGES.txt.

  • The JSONToXML filter no longer treats the JSON key content as a special value.

15.14.0 - Configuration Upgrade Steps

The following changes will be automatically performed on all configurations during the upgrade process. Configurations migrated from older versions after the upgrade will need to have update-configs.pl manually run to apply these changes.

  • Users with access to metamap.cfg and xml.cfg in Funnelback’s file manager will be granted the new sec.metadata-mapping and sec.xml-index permissions.

  • Content Auditor metadata configuration will be set in collections containing an an existing metamap.cfg file to avoid them inheriting the new default long metadata name based configuration.

  • metamap.cfg files will be migrated to the new metadata-mapping.cfg format.

  • xml.cfg files will be migrated to the new metadata-mapping.cfg and xml-index.cfg formats.

  • Invalid lines in profile.cfg files (possibly from padre-arg-sw) will be removed.

15.14.0 - Upgrade Issues

  • Web crawl NTLM authentication no longer uses the same config settings as HTTP basic authentication. Web collections relying on NTLM authentication must be reconfigured to use the new crawler.ntlm.domain, crawler.ntlm.username and crawler.ntlm.password configuration settings.

  • Since the default summary field metadata value (i.e. -SF query processor option’s default of [a,c,p,s]) is now visible without -SM being set, search frontends which set no SF vaule will return metadata for the a, c, p and s metadata classes.

  • The search interface no longer allows unnecessary activation of J2EE sessions and ui.modern.session.set_userid_cookie is no longer used - Cookies are automatically used as required if search session tracking is enabled.

  • Crawler Javascript link extraction is now off by default. Any collections relying on the old 'on' default must have it set directly during upgrade.

  • Support for instant-update style feeds (handle-feed.cgi) has been removed - Push collections are recommended as a replacement.

  • The workflow publish hook is no longer called with both the preview and live profile files. Instead it will always be called with one file path each time it is run. The hook now runs each time a profile file is edited in the classic administration dashboard and each time any config file is edited via the admin API this includes the marketing UI and implementer UI.

  • The JSONToXML filter has been updated such that it makes a better attempt at producing valid XML this can result in xml element names being modified as well as some characters being stripped from the content.

15.14.0 - Upcoming changes

  • A future version of Funnelback will remove the $SEARCH_HOME/lib/java/all directory in favour of using a new layout. If you are currently accessing this directory with workflow scripts or similar, you are encouraged to transition away from doing so, and to contact us to discuss any cases where transitioning is technically difficult.

Patches

Type Release version Description

3 Bug fixes

Upgrades log4j2 to version 2.16 to fix the security vulnerability where log4j2 JNDI features do not protect against attacker-controlled LDAP and other JNDI related endpoints.

3 Bug fixes

Fixes an issue where sessions are not terminated on logout events triggered by perl pages.

3 Bug fixes

Fixes an issue where deleting the collection property from collection.cfg and then deleting the collection itself would delete other configuration data.

3 Bug fixes

Removes the screens for file-manager rule editing which could create security issues

3 Bug fixes

Fixes an issue where support packages could contain unintended files

3 Bug fixes

Fixes an issue where the running Funnelback jetty web server could retain permissions via supplemental groups after startup

3 Bug fixes

Limits an administration CGI script to redirect only within the Funnelback administration interface as intended

3 Bug fixes

Removes the unused administration debug.cgi script which reflected input parameters without proper escaping

3 Bug fixes

Improves support for running faceted navigation on extra searches.

3 Bug fixes

Adds method 'getEffectiveExtraSearchName()' to the search transaction which gets the name of the extra search this search should be considered to be under. The result of this should be used when modifying a particular extra search. As Funnelback may create extra searches under an existing search, for example for faceted navigation, this could be used to work out if the search transaction should be modified.

3 Bug fixes

Prevent XSS AngularJS sandbox bypassing injection in Freemarker templates escaped using output formats by inserting zero-width whitespace between consecutive open-curly-brackets.

3 Bug fixes

Prevent XSS AngularJS sandbox bypassing injection in Freemarker templates by inserting zero-width whitespace between consecutive open-curly-brackets.

3 Bug fixes

Prevent XSS AngularJS sandbox bypassing injection in Freemarker templates by inserting zero-width whitespace between consecutive open-curly-brackets.

3 Bug fixes

Improve the performance of the Accessibility Auditor interface by requesting less data.

3 Bug fixes

Fixes an issue where some of the text on the Accessibility Auditor dashboard was showing out of date information.

3 Bug fixes

Fixes an issue where the Accessibility Auditor dashboard would not generate the thumbnail screenshots for each domain.

3 Bug fixes

Improves the query response time when sorting.

3 Bug fixes

Fixes an issue where large (>2GB) index.dt files would cause padre-gs to fail when setting gscopes.

3 Bug fixes

Fixes an issue where jetty would terminate on invalid 'index.autoc' (query completion) files.

3 Bug fixes

Fixes an issue where recording Accessibility Auditor details would fail during the swap views step if the server is in read-only mode.

3 Bug fixes

Fixes an issue where swap-views.pl did not clear the redis state before running the pipeline.

3 Bug fixes

Fixes a bug where query processing may fail when given dates in the query. Failure would always occur when the date was invalid.

3 Bug fixes

Improves the Accessibility Auditor historical data storage. The data is stored in less space while also being significantly faster when storing and retrieving data. The Accessibility Auditor historical data APIs are also improved to reduce the amount of memory needed to help reduce the chance of 'OutOfMemoryError' exceptions from being thrown. The Accessibility Auditor historical data will be automatically moved to the new storage format when Jetty is restarted (one collection at a time) or on the first Accessibility Auditor historical data API request.

3 Bug fixes

The default timeout for 'push.scheduler.delay-between-meta-dependencies-runs' has been increased to '1200' (20 minutes). This has been increased to reduce the frequency at which Accessibility Auditor historical data is recorded. This option will need to be overridden if meta collections containing push collections need a smaller delay in updating the spelling index and auto completion.

3 Bug fixes

Fixes a bug where the query processor may seg fault under some zero result cases, often seen when faceted navigation is used.

3 Bug fixes

Fixes a bug where a 'NullPointerException' may be thrown when a faceted navigation extra search fails or does not run.

3 Bug fixes

Fixes a bug where 'facetScope' URL parameter may not be correctly decoded.

3 Bug fixes

Fixes a bug where queries may not return when instant updates include URLs that contain ampersands.

3 Bug fixes

Prevents creation of objects within Freemarker template files to ensure that template editors can not cause external code to be executed.

3 Bug fixes

Fixes a bug where 'FineTune' may crash when 'query_processor_options' is longer than '1000' bytes.

3 Bug fixes

Push slaves will now actively pull down merge/vacuumed generations, rather than waiting for commits to trigger this. This can help solve problems where the slaves will not reduce the number of generations or re-indexes are not pulled down by the slaves.

3 Bug fixes

Fixes a bug where users that did not have access to all collections would not be given access to the collection they just created.

3 Bug fixes

Ensure the legacy jquery.funnelback-completion.js is included with Funnelback for backwards compatibility. When performing an upgrade, it is still recommended that you upgrade to concierge. See here for more information: https://docs.funnelback.com/15.14/customise/standard-options/auto-completion/upgrading-to-concierge.html

3 Bug fixes

Fixes security issues where:

  • The default form-not-found template reflected the given form id without proper escaping.

  • The default configuration of URL previewing could be used to expose local log file content.

Please ensure any custom form-not-found.ftl templates in collections are updated to perform correct escaping if they were derived from the previously vulnerable form-not-found.default.ftl.

Please ensure that any customised value for the global default_url_renderer.permitted_url_pattern setting in global.cfg prevents access to file:// URLs.

3 Bug fixes

Fixes a bug where the Content Auditor display-metadata config key ui.modern.content-auditor.display-metadata.<metadata class> was not being read correctly.

3 Bug fixes

Fixes a bug where the crawler would not correctly decode links in HTML, XML and plain text documents.

3 Bug fixes

Improves the performance of the directory gatherer.

3 Bug fixes

Fixes support for sort mode '3' in query completion, allowing 'alpha' to be respected.

3 Bug fixes

parent_group Facebook events field has been removed since it requires escalated permissions. On some Facebook collections, this caused crawling of events to fail.

3 Bug fixes

Provides additional metadata for twitter records specifying if a tweet is a reply and to what it is a reply to. This is made available in the XML under 'isReply', 'inReplyToScreenName', 'inReplyToStatusId', 'inReplyToUserId' and 'inReplyToUrl'.

3 Bug fixes

Adds an option to padre-iw to allow control of how the lock string should be modified. The option is "-lock_string_mod_mode=[legacy raw]". By default it is set to "legacy" and that keeps the current behaviour of modifying the lockstring. The "raw" option results in padre-iw not altering the lock string; leaving in white space, new lines and commas.

3 Bug fixes

Adds an experimental DLS plugin called "secBoolExpr" which is able to evaluate lockstrings which are boolean expression. For example, a lockstring where the user must have both B and C or A can be represented as A,(B AND C) or A | (B&C). This lockstring is evaluated as a boolean expression where the user keys are what is true in the expression. For example, if the user had "B" and "C" then they would be able to view that document. To enable the plugin set in collection.cfg "security.earlybinding.locks-keys-matcher.name=secBoolExpr" and set "-lock_string_mod_mode=raw" on the indexer.

3 Bug fixes

Upgrades the version of our internal libraries to account for recent breaking changes in the Facebook Graph API. This will fix issues that caused Facebook collections to fail to update on certain user accounts, when crawling more than 200 posts in an hour, and when crawling events posted by a page. To update existing Facebook collections that may be failing, the changes noted in deployment instructions below will need to be made on each groovy script.

3 Bug fixes

Fixes an issue with SAML authentication where users who did not have access to the funnelback_documentation collection were unable to log out.

3 Bug fixes

Fixes an issue with SAML authentication where some browsers would fail to redirect users completely after login.

3 Bug fixes

Stops the same sitemap.xml file from being processed multiple times on different threads.

3 Bug fixes

Fixes an issue where incremental updates would not copy forward documents.

3 Bug fixes

Fixes an issue where the admin API crawler debug utilities did not respect the 'crawler.form_interaction_file' collection.cfg option. It always assumed a default value of 'form_interaction.cfg'.

3 Bug fixes

Fixes an issue where the crawler did not respect the 'crawler.form_interaction_file' collection.cfg config option. It always assumed a default value of 'form_interaction.cfg'.

3 Bug fixes

Improves the internal re-filter program.

3 Bug fixes

Fixes an issue where configuring a site profile without http basic credentials would cause empty username/password authorization headers to be sent by the web crawler.

3 Bug fixes

Removes a complexity limit in contextual navigation which applied to the entire query rather than only to the terms relevant for contextual navigation. This allows contextual navigation to work on complex queries, such as those generated by facets.

3 Bug fixes

Fixes an issue where the web crawler parser would time out when parsing large (10MB+) HTML pages.

3 Bug fixes

Fixes an issue where two jars provided conflicting classes with the same package name; this results in errors if a filter or groovy script used the 'org.json' package. This patch fixes the issue by forcing the use of 'json-20160810.jar' (org.json:json), which provides everything 'json-1.8.jar' (com.tdunning:json) did and more. See https://issues.apache.org/jira/browse/TIKA-2556 for further details about the change.

3 Bug fixes

Updates the search sessions click history to no longer record all metadata into the DB. Search sessions will only record the metadata classes listed in profile.cfg option 'ui.modern.session.search_history.metadata'. By default this is empty, but can be set with a comma separated list of wanted metadata classes for example:

ui.modern.session.search_history.metadata=a,b,c

3 Bug fixes

Provides a collection.cfg setting (crawler.send-http-basic-credentials-without-challenge) to restore older web-crawler behaviour of sending HTTP Basic authentication details without waiting for a 401 challenge from the web server.

This patch includes a default setting to default to the old behaviour.

3 Bug fixes

Fixes a bug where ratio to run full or incremental updates was not being applied and only a full update was triggered.

3 Bug fixes

Fixes a bug for scheduled updates where the 'schedule.incremental_crawl_ratio' parameter was not being respected.

3 Bug fixes

Fixes a bug in Accessibility Auditor which caused the document audit view to fail when a document contained escaped or unicode characters in their classnames.

3 Bug fixes

Makes the web crawler pre-crawl form interaction accept login forms which are served with an HTTP 401 or 403 code.

3 Bug fixes

Fixes a potential indexer crash introduced in 15.14.0.1, and some additional cases where multiple dots could be shown in summaries.

3 Bug fixes

Fixes query biased summaries so that it doesn’t show multiple dots when the original content contains non breaking spaces as the only value within "p" tags.