Designing and building a web search

Background

This article covers things that should be considered when desigining and building a web search using Funnelback. It covers things like collection design (e.g. Should I use one or more collections or profiles?)

Before you start

There are a number of things you should do before you start building a website search.

  • Spend a little bit of time familiarising yourself with the sites that you will be indexing.

  • Think a bit about what should be included in the search.

    • Are there file paths that should be excluded? Don’t just index everything because it’s there. You get a better end result if you don’t fill the index up with useless items.

    • Are there file types that should be included or excluded?

    • Should the crawl include subdomains?

    • Should the crawl be restricted to only https?

  • Check if the site has a robots.txt and embedded sitemap.xml.

  • Inspect some of the pages on the website and evaluate if there is any useful metadata that would be desirable to use in the search.

There are a number of things that can be done within the website to provide a better quality search experience. This is not always possible but definitely should be considered as part of any search design as small changes here can have a big impact on the overall quality of the search results.

  • Use robots.txt and robots meta tags to control what parts of your site are indexed by web crawler and modify your site templates to include noindex tags. These are a very effective way of improving ranking quality and provide a way of hiding parts of the page (such as headers, footers and navigation) from Funnelback. see: web data sources - controlling what is included.

  • Create a sitemap.xml containing links to orphan pages that should be included in search results. Ensure this is linked from your robots.txt.

  • Consider tagging pages with appropriate metadata so that you can provide effective faceted navigation and auto-completion as part of your search.

  • Publish (unique) canonical URLs for your pages to ensure that crawlers index your content using preferred URLs.

Collection design

Once you have a handle on what should be crawled and indexed the next step is to think about how to structure your Funnelback collections that will support the search.

Should you have a single collection that has multiple profiles configured on it or a collection for each domain that are combined into a single search using a meta collection? The decision to have a single collection or when to break a web crawl into multiple collections can be quite confusing.

One or more collections?

As a general rule you should use as few web collections as possible.

A single web collection allows Funnelback to take into account both on-site and off-site links when ranking a document. Using a single collection also reduces the chances of duplicate results and is easier for an administrator to maintain and uses server resources such as memory far more efficiently.

It is generally not a good idea to have a separate collection for each domain in a multi-site crawl or to break a single website up into multiple sub collections e.g. based on the folder structure.

Multiple collections are hard to maintain - you need to ensure there isn’t an overlap of content between collections (making it a complicated task to manage the include/exclude rules) otherwise there will be duplicates in the search index.

Ranking is also poorer with multiple collection because you lose cross-site (and even same-site) information when a large web collection is broken down into multiple collections unless you perform extra configuration for the indexer to instruct it to include link information from other specified collections.

Having multiple collections could also mean that more load is placed onto your web server when crawling (if you happen to update more than one collection at the same time). More load is also placed on the Funnelback server if you crawl multiple collections simultaneously as each collection update takes a chunk of resources on the server.

Meta collections also have a maximum number of components (this is limited to 64 collections, and this is less if you are including any push collections in your meta collection).

Separate collections are appropriate:

  • if a subset of content needs to be updated on a different cycle. Example: a media organisation crawls their web content on a 24 hour cycle but the news section of the site is crawled every 30 minutes.

  • if a subset of content requires NTLM authentication in order to access the content.

  • if different web collections need to be administered by different people.

  • if the amount of data being crawled is too large for a single collection. Sometimes splitting up a large collection will be required in order to be able to complete an update within an acceptable time. However before doing this crawl optinisation options should be considered.

If you have to break up a web crawl into multiple collections:

  • ensure that there is no overlap between the different web collections (using your include/exclude patterns). If you crawl the same page in two collections it will appear as a duplicate in any search results where the two collections are part of a meta collection being searched. It is also possible for duplicates to appear under variants of URLs in this context so this will also need to be considered when defining include/exclude patterns. e.g. http://example.com/, http://www.example.com/ and http://example.com/index.html may all be the same page. The URL that is stored will depend on how the URL is first discovered during a crawl (unless you have defined a canonical URL for the page).

One or more profiles?

Multiple profiles can be set up on top of a meta collection allowing many different searches to share the same underlying index.

When you define a profile you can configure it to only search over a subset of pages in the index. For example, if you crawled all of the pages at a university in your web collection you could then provide profiles for individual departments within the University.

Each profile works like a standalone search, with dedicated templates and features like faceted navigation, synonyms and best bets and also dedicated analytics and reporting. Profiles can also have dedicated ranking. The only real difference is that the underlying index is the same for all of the profiles.

Profiles are a great way to re-use your indexes in different contexts and can avoid the need to set up multiple collections.

Optimisation

After you’ve run your first crawl you should spend a bit of time optimising the collection(s). Have a look at the crawler log files, in particular checking out:

  • url_errors.log: This logs errors that occurred during the crawl. You will see things like 404 (not found errors) which can be addressed by a site owner as well as other issues like 400 or 500 errors and warnings such as files that were skipped as they were larger than the configured download size. If you see lots of errors due to timeouts you may need to increase the crawler’s request delay. If you see lots of repeated 404 errors it could be a bad link that is in a template that you could add an exclude pattern for.

  • stored.log: This log shows what documents were stored by the crawler. Check the end of the log to see what was being stored before the update completed. If the update still had uncrawled pages (and reached a crawl timeout) the this will provide an indicator of the slow sites and can also show crawler traps. This information can be used to add additional exclude rules if appropriate.

Include exclude rules, sitemaps

Make use of the include/exclude rules, robots.txt and metadata directives and sitemaps to control what gets stored and indexed by Funnelback. Reducing noise in the search index is the most effective way of improving your search quality.

Timeouts and crawler threads

Adjusting the crawler timeouts and number of crawler threads can have a large impact on the overall time taken to crawl a site. By default the web crawler will only ever have one active request for a site. Increasing the number of crawler threads can speed things up (but you need to ensure that the web server is capable of handling the extra crawler load).

Site profiles

Site profiles can be used to set a number of crawler parameters that apply to specific domains. This includes being able to set the request delay and number of crawling threads as well as HTTP basic authentication.