Tutorial: Designing a website search

This tutorial looks at what is involved in designing an effective website search solution and provides methodology for gathering requirements and translating these into a functional design.

Planning for website search

There are a number of things that should be considered when planning and designing a website search.

Each of these should be considered as part of the process.

Understand the purpose of the search

It is critically important to understand the purpose of the search and the problems it aims to solve.

This should be considered before thinking about anything else because the design of an effective search is fully dependent on this.

As part of understanding the purpose of the search:

Consider the audience that you are targeting with the search - e.g.
- Is the audience technical or non-technical? (this will impact decisions such as the inclusion of an advanced search form)
- Is the audience highly familiar with the internal workings of your organization/site? (e.g. understands internal language or how things are organized or are they the general public?)
- Is the search about answering very specific questions or generally targeting more general questions with relevant answers?
Consider how the content may need to be scoped.

Understand the current pain points

If the search is replacing an existing search function understanding the current pain points is an extremely important factor in delivering an effective search.

Think about the business goals that the search is trying to solve and focus on the pain points in the current solution.

If the goal is to replace an existing search then determine what currently works well and what causes frustration.

Mock up search results screens

It is always worth undertaking an exercise to create some basic mock-ups of the search results screens.

This is a really important step that helps the understanding of the customer’s requirements and exposes underlying data requirements. It also helps the search owner to understand more abstract requirements (such as the need for metadata to power certain things that are often requested in the brief).

High level design goals

When working through the requirements always try to deliver a solution that makes use of standard functionality within Funnelback.

Always keep in mind the purpose of the search and the problems that it should be solving.

Question requirements which do not make sense

This may sound quite obvious, but it is one area where many solution designers fail the search owner.

It is important to remember that your expertise as the solution designer is a valuable service that the search owner is paying for and that often the search owner does not really know what they want (even if they think they do know), or are unable to clearly articulate it.

So, don’t treat the requirements as gospel and blindly agree to all the requests. While you must understand the perceived requirements do not be afraid to provide advice when requirements don’t contribute to the achieving the purpose of the search or if there is a better way to achieve the desired outcome. Explain why you feel a requirement will be detrimental to the search and if the customer still insists on the requirement being included then at least they have made an informed decision.

If designing a search for yourself carefully consider the requirements and assess each for the benefit it will provide when compared with any increase in complexity to the search design. This will help to determine which of the requirements are really important and which ones are just nice to have (and in fact which ones might actually diminish the effectiveness of the search).

Keep it simple

There are many reasons why keeping it simple will contribute to the effectiveness of a search. These include:

The solution focuses on solving the core problems. This makes the search more effective and easier to use as users are not presented with features that provide little benefit.
The solution is easier to understand and maintain and support.

Avoid customization

Customization, beyond the result templates and use of plugins should be avoided where possible as it can affect the ability to upgrade easily and can also have other unintended side effects on the correct functioning of the search.

Where possible avoid solutions that:

Require custom plugins to be developed, especially if there is a dependency on the structure of the markup of the content.
Require customization of the faceted navigation or auto-completion behavior.
Require complex manipulation of the search question and response using plugins or curator rules.

Avoiding customization has a number of advantages:

Implementation time is reduced, so delivering the search is faster and cheaper.
Upgrades are simplified as standard functionality will be upgraded automatically by Funnelback when an upgrade occurs. This reduces the cost and time to upgrade to a newer version of Funnelback.
Bugs within standard functionality can be reported to the developers and patched at no cost to the search owner (depending on the nature of the bug). Bugs in custom functionality require custom development and are unsupported.

Initial investigation

There will usually be some very high-level requirements or goals that are known when starting on a search solution design project

It is often useful to spend a bit of time using this information to conduct and initial investigation of the site or sites that are to be searched and to put together a rapidly built prototype that can be used to assist both yourself and the customer in understanding what can be achieved with the search and the available data.

Creating a prototype

The first step to creating a useful prototype is to spend some time familiarizing yourself with any freely available content and building up a picture of what is possible (given the available data). What could be used to provide relevant functionality for a search given any known design goals.

For example, if the broad goal is to replace an existing site’s search start by looking at the following:

Have a look at any existing search to see what features the search has - this will provide a good starting point for any prototype. Evaluate the effectiveness of each feature in terms of the current and potential effectiveness.
Have a look at the available content and evaluate this for potential useful search features.
This could include the potential to get access to database content that might populate parts of an organization’s website or investigate what social media channels are used by the organization. YouTube in particular is often a very useful addition to a site search as it provides another information channel that is often not hosted on the organization’s site.
Investigate what metadata is available and what might contribute to enhancement of the search.
Sometimes it is worth mocking up some content to also prototype how a feature might work if the information is not available or not directly available to the search. Having a working prototype will provide a useful reference for the customer and also for the requirements gathering session.

Requirements gathering

Requirements gathering is the process of determining all the details that are required for the search. The requirements will inform how the search is built. The requirements gathering should be undertaken with the owner and representatives of other important users of the search.

Consider what repositories should be covered

Determine all the different content repositories that should be covered by the search. This is determined by the individual result items and what is expected to be a possible search result.

Are there any useful social media repositories that would enhance the search?

Is there any database/XML/CSV/JSON record data that would be useful to add to the search (such as a people directory)?

For each repository:

Think about what should be included and excluded from the repository. The search will provide more accurate results if ephemeral content is not included as this produces noise in the search index. E.g. consider excluding pages that just list items and focus on indexing the items themselves. Educate the customer on effective use of robots.txt, robots meta tags and also Funnelback no index tags.
Does the site have an appropriate sitemap file that could be used?
Are there any calendars or existing searches/directory browse functions that are likely to be crawler traps and should be excluded?
Are there any binary file types that should be included?
Is 10MB an appropriate maximum file size to accept for the index?

The search results screens

When investigating the requirements for a search always start with an exercise to determine all the different search results screens that will be required.

For most searches this will just be a single search results screen but for more complicated searches Funnelback may be required to power a number of search results pages.

A search results page is any listing of search results that are returned by Funnelback and includes:

Standard search results listings (either as templated HTML or as JSON/XML). This is any interactive search results page and includes site search as well as any specific searches that may be delivered (such as a course finder or people directory).
Search powered content (listings or content pages where the Funnelback search index is used as a database to populate content. This could include:
- browse listings
- CMS content pages that are populated from data provided by an embedded query to Funnelback.
- GeoJSON data provided via search used to power a map.
- Data export options that are populated with search results (e.g. Download as CSV)

The following questions should be considered for each of the search results screens identified.

What is the purpose of this search interface?

Understanding the purpose of the search is the key to providing an effective solution.

Where does this interface fit in the overall solution to the stated purpose of the overall search?

What is the intended audience of this search interface?

Understanding the audience of the search interface will inform some of the key design decisions and the features that will be suitable.

If the audience is highly technical then features such as advanced search forms may be appropriate.
If the audience is made up of general users then a simpler interface that guides the user with features such as faceted navigation will be more appropriate.

If the search is targeting too many different audience groups it may be worth providing different searches for different audiences. This could be achieved via different audience-specific results pages that contain different content or ranking settings.

What items are covered by this search?

Determine what the overall content set should be for the search interface. The set of information appropriate for the search interface doesn’t have to be the entire dataset and may include a subset of the data sources indexed, and even a subset of items within specific data sets.

What repositories/data sources are included?
Are binary files expected to be included? If so what file formats?
Is the default (10MB) file size limit acceptable? Does this need increasing?

Do filters need to be provided on the search results?

Use this as an opportunity to demonstrate how standard faceted navigation works and determine if standard functionality will deliver the required functionality.

filtering of the search results should be considered more than just faceted navigation. It may include additional controls such as slider widgets or date range pickers that are used to filter the search results.

Concentrate on filters that provide benefit to search users (e.g. a file type filter is quite easy to provide, but often provides little benefit to users who are generally more interested in finding content based on the topic with little care for the format.

What filters are required?
From where will the data be sourced? (e.g. metadata requirements for facets, numeric metadata for any ranges)

Auto-completion

Think about how auto-completion works and determine if standard simple (organic) suggestions are sufficient or is there a requirement for structured (rich) or concierge (multi-channel)

If structured suggestions are required, what are the data sources?

If concierge is required, what are the different autocompletion datasets? Wireframe the concierge completion and ensure all the desired metadata exists in the source content.

Unless the audience is a group that are familiar with advanced search (such as librarians or legal practitioners) then the use of an advanced or custom search forms is not recommended. There is research that shows normal users get better results from a simple search form.

Filters (faceted navigation) provide advanced search functionality, but without the confusion or traps that result in users specifying queries that return no results. Facets 'guide' the user to create an advanced query without them needing to understand and navigate an advanced search screen.

Wireframe the results screen

Draw wireframes of the results screen.

Is the design to be responsive? Or a separate mobile template required? (Yes to either of these will require additional wireframes). A separate template may also require a separate profile.

Is the customer providing the design? (HTML markup patterns, CSS, JavaScript). The customer should be responsible for providing these unless specific arrangements are made with Funnelback.

Ensure you capture:

High level page layout
Result formats (i.e. a mini wireframe for each type of result)
Consider where each 'field' is going to be sourced
Don’t forget about thumbnail URLs. If a thumbnail is required where does it come from (e.g. metadata field) or what are the rules to construct the URL to the image using information available to a search result.
Don’t forget about the no-results page format.
Consider appropriate stencils that can be used and articulate how these will work.
Non-standard filetypes that are required (we only index the following by default: html, MS Office (Word/Excel/Powerpoint), RTF, PDF and text documents). Make sure these can be filtered successfully before agreeing to include them.
Included features? Try to understand the goals of the search and use this information to advise the customer on features that would enhance their search. When talking to the customer don’t get bogged down in Funnelback terminology. Explain in plain English what a feature does as a widget that is required in the results template might be implemented using a combination of features that work together. E.g. you might discuss a 'filter panel' that comprises faceted navigation and some additional filter controls (such as a date picker or range slider). Things to consider include:
- Spelling suggestion message location
- Query blending message location
- Filter (facets) interaction
- Related searches (contextual navigation)
- Query completion
- Quick links? Quick links search box?
- Best bets
- Saved searches/search history (sessions/shopping cart)
- Extra search panels

Synonyms / thesauri

Is there a requirement for synonyms or thesauri?

Does the search interface have any geospatial or mapping components?

Some questions to consider:

Is the data geocoded? If so is it in the correct format?

Postcode or location to geocode mapping requires additional work
Does the search need to be aware of the user’s current location?
Is there a relevance component that is relative to a location?
Is there a requirement to plot the search results on a map?

Is any business logic required for the search interface?

This includes capturing any rules required for predictive segmentation or results curation as well as other business logic that will need to be implemented using plugins.

Be wary of long wish lists and don’t be afraid to educate the site owner on why what they want might not be a good thing.

Identify if there is any custom functionality that needs to be developed

This isn’t a question for the customer, but something that needs to be noted for anything that is going to require custom implementation - this will be invaluable when planning the implementation.

You don’t need to consider basic template customization.

For anything that is identified as custom, if there is an alternative approach that uses standard functionality to provide similar functionality then discuss this with the customer highlighting the benefits of using the standard functionality. Benefits include cost, maintenance overheads, ease of upgrade etc. Custom functionality may need to be re-implemented on upgrade or for other reasons (such as custom functionality that integrates with a 3rd party API that may change with little notice).

Capture as much detail as possible.

Can it be done vs should it be done

It is really important to consider the question of can vs should.

Can the custom functionality be implemented in Funnelback?

Should the custom functionality be implemented? Consider if there are reasons why it would be better to remove from the requirements or implement the custom functionality outside Funnelback. When thinking about this it is often necessary to think of the bigger picture and how the search fits into the overall system. The custom functionality may be better delivered using a different tool or not at all. There may be significant ongoing maintenance costs for a piece of functionality and this needs to be considered. Just because it can be implemented doesn’t mean that it should be implemented.

Integration method

Discuss integration options and determine how the search interface will be integrated with the end system.

standalone search
standalone search with remote included headers/footers
integrated search (i.e. partial template that returns a HTML code chunk)
custom JSON or XML
direct access to JSON/XML (clearly explain the disadvantages of this approach)

The disadvantages of integrating with the standard JSON/XML endpoints include:

Any Funnelback features that rely on front-end implementation (such as faceted navigation, sessions, pagination controls) must be implemented by the code that interprets the JSON or XML.
New features that are added to Funnelback or those that were not implemented will not be available until further work is done to interpret the new data model elements and format the data appropriately.

HTTPS access

Do the search results need to be available via HTTPS on a non-default domain name?

If the search is being delivered on a non squiz.net or funnelback.com domain then the site owner will need to supply a certificate.

Repositories

For each repository/data source, where is the data coming from? If it’s not web content (e.g. database query, XML), is this a supported data-source?
Do any of the repositories require authentication in order to access the content?
If more than one data source requires authentication are the usernames in common?
Are there any subsets of these that should be excluded?

Capture the different repositories that will be included in the overall search. This information should be derivable from the search interface discussion that has occurred.

The following should be completed for each repository.

What is the data source?

Is it a supported collection type or will custom gather code be required? (e.g. to download and process XML)

For websites - check that they are crawlable (e.g. not obscured by JavaScript, search boxes (i.e. only way to get to content pages is via a search), blocked by a firewall or excluded using robots.txt).

How large is the data source? For web data sources get approximate document counts. For XML/database style data sources find out the number of records. For enterprise data sources types (e.g. fileshares, TRIM, Sharepoint) determine the number of documents and size of the repository.

Data source access

How do we access the data? (e.g. standard web crawl, via a SOAP API, database crawl, XML export etc)

Make special note of

Any non-standard requirements.
Any authentication involved in accessing the data source.

External metadata

If it’s a CMS is external metadata required to ensure that word and PDF files have the correct titles, descriptions etc.

Sessions and history

Is there a requirement for sessions and history?

Sessions and history is a front-end feature that closely integrates with the default Freemarker template. When using sessions and history Funnelback should be used to return the full results page. It is possible to import the JavaScript into an existing template, but it is quite complicated and not supported.

Security

Is any authentication required to get to the documents? If so what type of authentication is used? Is it single sign on? Do we need to perform document level security? This requires more detailed analysis to ensure that the content is crawlable.

Other requirements

Are there any other special requirements?

Reporting requirements

Analytics

What analytics reports are required. It is typically useful to have separate analytics for each (interactive) search interface.

Understanding this will feed into the results pages that are required as part of the design.

Content auditing

Are content audit reports required for any results pages?

Are there any custom metadata fields that should be reported on (in summary graphs and also in the search results tables? - note: specific requirements for content auditor customization may require additional time to implement.

Accessibility auditing

Are WCAG reports required for any web content?

Do the WCAG reports need to be segmented so that different groups of pages are reported on separately?

Administration requirements

What level of administration access is desired? (e.g. ability to edit templates, just view the reports)
Required users (beyond delivering two users with report and edit/report roles)
Level of administration access. Note the admin user limitations in the hosted environment (if applicable)

Hosting requirements

Does the search have any requirements that mean dedicated hosting is required?

Update frequency

Are there any specific requirements for update frequency?

Is there any requirement for instant updating (web and DB collections only)? Or push collections?

Search solution design

Design of the search solution requires a high level of understanding of what Funnelback features are available and how the functionality available within Funnelback works.

The design process involves taking the requirements and turning it into and internally facing functional design that specifies a solution that can be built using Funnelback.

Where the requirements should focus on functionality and the user interfaces from an end user perspective, the functional design needs to convert this into something that can be built and must be specified from an implementer’s perspective. This means there will be a focus on search packages, results pages and data sources, and specific Funnelback features used to deliver the functionality outlined in the requirements.

This section provides guidance on designing a web search solution once the requirements have been specified.

High level design

Analyze the requirements

Start by analyzing the different search interfaces noting:

The data sources required to power this interface
A list of metadata fields that will need to be extracted from each data source to be able to populate all the different fields in the output.
Any data sources that need to be excluded from the search results listing.
Any data sources that are used to provide an extra search.
Any data sources that are used to generate auto-completion
Any other sources of auto-completion.

Determine the solution structure

The data sources identified when analysing the requirements will suggest a suitable structure for the solution.

The structure will be determined by the following factors:

The type of repository - different repository types must be set up as separate collections. E.g. web collections must be separated from social media collections
Web data sources can be grouped into a single collection or split into several web collections depending on requirements. In general, it is best to group all the web collections together if feasible as the ranking will be improved by any cross-site links. Splitting is required if:
Update frequency needs to be varied (e.g. a section needs to be updated very frequently compared to the rest of the content.
Authentication needs to be varied across sites (though site profiles may allow you to do what you need in a single collection).
For most searches a single search package with multiple results pages will suffice, but in some circumstances you may require multiple search packages (e.g. if there is a requirement for collection level security and different searches have different requirements for this).

Identify results pages

Separate results pages are required for each separate search interface as defined in the requirements, and also for any generated auto-completion. This is to allow for separate best bets, synonyms, curator rules and ranking to be applied. It also is required to provide separate analytics, content and accessibility reporting.

Analytics

Be mindful of what will end up in the search analytics.

Make use of the system query field to hold values that shouldn’t show in the analytics (like metadata constraints or queries that are designed to return everything) or use a separate results page to serve 'canned' searches.

Custom functionality

For each of the pieces of custom functionality identified during requirements gathering:

Think about how the feature might be constructed and how it will integrate with Funnelback.
Try to maximize the use of standard functionality
Custom functionality will need to be implemented as one or more plugins, or by defining curator rules.

Custom gatherers

For data sources that are web accessible use a web data source if feasible. This includes single file XML/CSV/JSON data sources where the data file is available via a web server.
For data sources that are accessed via an API a custom data source that uses a custom gatherer plugin is recommended.

Document filters

For each data source consider if there are any requirements for filters.
Filters can be used to analyse, extract information from or transform the documents after they are gathered, but before they are indexed.
Try to make use of existing filters (such as the metadata scraper, metadata normalizer) and plugins where possible.
When filtering HTML documents implement Jsoup filter plugins if possible.
Restrict each filter to performing a single function.

Query-time transformations

Query time modifications can be made using:

search lifecycle plugins
curator rules

Functional design

The functional design (or functional specification) is a document that encapsulates the design in a format that maps the requirements into something that can be built with Funnelback.

Produce and overview or high-level design to show how the search solution is composed.

Work through each of the search results pages capturing the Funnelback-specific detail required to satisfy the requirements.

Data sources

For each data source:

Define a suitable data source type (e.g. web, YouTube)
Define the seed URLs (or equivalent)
Define include and exclude rules
Also consider included file types
For web data sources:
- consider if accessibility auditing should be enabled
- consider if Sitemap.xml support should be enabled
- For multi-site crawls consider if there are any site profile settings that should be defined
- For multi-site crawls consider if quick links should be enabled
Define the metadata that must be extracted
- Consider which fields should be treated as indexable content, and if any fields contain special information (geospatial or numeric metadata).
Define any indexer options that should be applied

Search packages

Define the results pages that will be used
Define the set of data sources that should be included
Define the methods of integration and search endpoints that will be accessed
Define any access restrictions that need to be applied

For each results page:

Define any scoping of the data sources that should be applied - this includes a sub-set of data source collections, and even sub-sets of the data within a specific data source.
Define any ranking or display options that should be applied
Define any faceted navigation and the corresponding source of the facet categories
Define what query completion is required, and what the sources are
Consider which features are required:
Query blending
Stemming
Spelling suggestions
Curator rules
Best bets
Synonyms
Quick links
Search sessions/history

Consider additional search requirements:

Define the data sources where the results will be sourced (to determine if/how the results page needs to be scoped)
Define any query processor options to apply
Define the no-results screen
Consider any specific content auditing requirements
Specify the integration method and template wireframe (if applicable)
Specify the metadata fields that are required to render the template or satisfy the frontend requirements.
Specify any query-time hook scripts that are required and outline the purpose and provide pseudo-code for the algorithm.

Content change requirements

There are often changes that must be made at the content source as part of the search solution to contribute to the search solution.

Define any external metadata feeds that are required (to associate metadata with binary content such as PDFs) and specify where these will be sourced. They will probably require workflow to fetch. External metadata should ideally be supplied in the correct format otherwise additional workflow will be required to transform it. External metadata should also be validated as any errors in the file will cause the indexing to fail.
Define any API keys, user accounts etc. that are required for the search solution.
Consider if there are any requirements to create a page to source headers and footers from
Specify any recommendations for updates to robots.txt or robots meta tags in source web content.
Funnelback no index tags should be added to source html templates to suppress headers, footers and navigation from being included in the search index if feasible. This will improve the result quality by reducing noise within the index.

Tutorial: Designing a website search

Planning for website search

Understand the purpose of the search

Understand the current pain points

Mock up search results screens

High level design goals

Question requirements which do not make sense

Keep it simple

Avoid customization

Initial investigation

Creating a prototype

Requirements gathering

Consider what repositories should be covered

The search results screens

What is the purpose of this search interface?

What is the intended audience of this search interface?

What items are covered by this search?

Do filters need to be provided on the search results?

Auto-completion

Is an advanced/custom search form required?

Wireframe the results screen

Synonyms / thesauri

Does the search interface have any geospatial or mapping components?

Is any business logic required for the search interface?

Identify if there is any custom functionality that needs to be developed

Can it be done vs should it be done

Integration method

HTTPS access

Repositories

What is the data source?

Data source access

External metadata

Sessions and history

Security

Other requirements

Reporting requirements

Analytics

Content auditing

Accessibility auditing

Administration requirements

Hosting requirements

Update frequency

Search solution design

High level design

Analyze the requirements

Determine the solution structure

Identify results pages

Analytics

Custom functionality

Custom gatherers

Document filters

Query-time transformations

Functional design

Data sources

Search packages

Content change requirements