Search design and optimization

General principles

Avoid customization

Customization, beyond the result templates should be avoided where possible as it can affect the ability to upgrade easily and can also have other unintended side effects on the correct functioning of the search.

Where possible avoid solutions that:

  • require custom filters, especially if there is a dependency on the structure of the markup of the content.

  • require manipulation of the query or response data via hook scripts.

  • require customization of the faceted navigation or query completion behaviour.

Avoiding customization has the following benefits:

  • Implementation time is reduced so it’s cheaper for the customer to make use of standard functionality.

  • Upgrades are simplified as standard functionality will be upgraded automatically by Funnelback when an upgrade occurs. This reduces the cost of upgrading to a newer version of Funnelback.

  • Bugs within standard functionality can be reported to the developers. Bugs in custom functionality require custom development and are unsupported.

Exclude unnecessary content

Use whatever processes are available to prevent unnecessary content from ever entering the search index.

This includes:

  • For web collections using robots.txt, robots meta tags and Funnelback no index tags.

  • Using the gatherer’s include/exclude mechanisms.

  • Gathering content as a user that only has access to the relevant documents.

Data cleansing

Where possible clean any data at the source. This includes:

  • Adding noindex directives and any includes for header/footer includes

  • Adding/extracting metadata

  • Transformation of database fields

Should a filter or a results page plugin be used?

Data cleansing efforts should be applied as close to the source as possible. The order of priority for cleaning should be:

  • Source: Can you arrange for the data to be as close as possible to the expected format? Can you gather only what is needed (include / exclude patterns, noindex tags)?

  • Data source plugin: changes the content before it is indexed.

  • Results page plugin: changes the content in the data model as it is returned from the index.

  • Server-side template (Freemarker): changes the content as it is read from the data model as it is being displayed to the end user.

  • Client-side scripting (JavaScript): changes the content within the user’s browser.

Cleaning the data closer to the source has a number of benefits:

  • It is easier to understand what is going because there are an increasing number of places where cleaning can occur. e.g. having JavaScript code correct something in the data for display would require an implementer to inspect the JavaScript, then FreeMarker, then the hook scripts, the filters and finally the data to be able to understand what the JavaScript is doing. It is also confusing for the content owners as the cleaning is often not visible to them - so what is in their source content is not reflected in what they see in the search results.

  • The data clean has a wider effect. E.g. cleaning code in the Freemarker template does not affect the JSON and XML output, and index related functionality such as sorting will not be affected by any changes. Cleaning done in a hook script will not affect the cached copy of the document. Cleaning at the source will affect everything (including other search engines that might index the content).

    Example: cleaning a title - if you change the start of a title in a results page plugin then sorting it by title might give strange results, because sorting is applied by the query processor, before the results are returned to the UI.

  • Preventing unwanted or uncleaned data from entering the index will improve ranking quality as there is less noise in the index.