Best practices - 2.2 collections

Background

This article details implementation best practices related to collections.

Collection naming conventions

  • Use lowercase characters and dashes only

  • Keep the names succinct and meaningful, concentrating on the purpose (e.g. public-website, intranet).

  • Avoid having the collection type as part of the name.

  • If you have multiple collections for the same client, prefix with the client (e.g. client-public-website, client-intranet)

    • For Government organisations that tend to change name regularly try to prefix using their purpose. (e.g. health instead of dept-health-and-wellbeing or dhw)

  • Use well known abbreviations like nsw for New South Wales.

  • Don’t use shortened names for example by dropping a few letrs cos its hrd 2 undrstnd.

  • Avoid using complete domain names (e.g. www-client-com-au)

Web collections

One web collection or many?

As a general rule use as few collections as possible as this has a number of benefits:

  • Maximises ranking opportunities (as cross-site links will be included in the ranking).

  • Fewer updates to schedule which has a lower resource footprint on the server.

  • Allows more repositories to be added to a meta collection (as meta collections have a limit on the number of collections that can be included (and any push collections that are included also eat into this limit as they are structured in a similar way to meta collections).

Separate out a collection:

  • If the collections need to be updated on different update cycles (e.g. a news section of a site updated every 30m, the rest of the site every 24h).

  • If the collections contain internal and external content and these should be separated because they are used in different meta collections (e.g. a public meta collection including an internet web collection and an internal meta collection containing intranet sites and the public sites).

  • If the site has NTLM authentication (only a single NTLM username and password can be specified on a collection). Note: sites with basic HTTP authentication can specify username and password per domain using site profiles (site_profiles.cfg); and sites using form-based authentication can specify the settings using form interaction.

Control what is included and excluded

Including:

  • robots.txt, robots meta tags, sitemap.xml and Funnelback noindex tags.

  • include/exclude URL patterns.

  • crawler non-html extensions.

  • parser mime types.

  • download and parser file size limit.

  • crawl depth limits.

Use site profiles to limit multi-site crawls

Site profiles can be used to set various limits and also specify HTTP basic authentication to use when crawling specific domains.

Web collection optimisation

Perform a post-crawl analysis

Perform a post-crawl analysis that investigates various crawl logs while a web collection is under development, and periodically after the collection is live.

  • Look at the url_errors.log to see if there are common patterns of URLs that should be excluded from the crawl.

  • Look at the url_errors.log to see if there are files larger than the crawler.max_download_size that should be stored.

  • Look at the tail of the stored.log to see if the crawler is getting into any crawler traps that can avoided by defining exclude patterns, of if there are other things being stored that shouldn’t be.

  • Look at the crawl.log.* files to see if any crawler trap prevention has been triggered.

Optimise settings

There are many settings that can be used to optimise the web crawler, that set various timeouts and limits.

If the site has a large number of domains then crawl speed can be increased if the number of crawler threads is increased as this sets an upper limit on the number of sites that can be crawled concurrently (the default is 20). Note: be sure to monitor how this affects the server performance as more threads will use more CPU. It may be necessary to increase the number of CPUs available to your Funnelback server.

If a site that is being crawled is sufficiently resourced then the site profiles can be used to specify multiple concurrent requests against the site (but be sure to get permission from the site owner first).

Database collections

Request a Funnelback-specific database view

The database owner should create a view that contains all the data in a de-normalised form that is required for the Funnelback index. This allows Funnelback to run a simple SQL query (select * from VIEWNAME). Using a view also makes it clearer to the content owner as they are in control of the view and can alter the fields as required.

Transform data in the SQL query

Where possible data transformations such as combining fields into a new field (like creating a name field from firstname and lastname or creating a sort field) and removing/renaming values (like _NULL_) should be performed in the query that sets up the view.

Custom collections

Custom gatherer should only implement gather logic

When writing a custom gatherer ensure that it only implements the logic required to gather the data from the data source.

Use the filter framework to perform all filtering tasks on custom data - this ensures that all filtering is performed in the same way with the same interchangeable filters, regardless of the collection type. The filter framework has been available for custom collections since Funnelback 15.12.

Meta collections

Shared configuration files

Sharing templates between profiles

Templates can be shared between profiles by referencing the master template using a Freemarker <#include> directive. See: sharing templates across profiles

Symbolic linking of configuration files

Do not use symbolic links to share configuration files between collections or profiles.

Symbolic linking should be avoided because it is unclear to the end user that the settings they are changing may impact on other collections that also use the same configuration file. Use of symbolic linking may also break newer product functionality (such as configuration that is published per-entry rather than per file as was previously done).

Symbolic linking of configuration files was previously advised as a best practice with meta collections. This is no longer the case as the way Funnelback handles configuration has changed.

Prevent access to component collections

For collections that are not supposed to be queried directly, prevent access with:

access_restriction=127.0.0.1 (And possibly any other internal range to permit monitoring)
access_alternate=<meta-collection-id>

This prevents query access to the collection and redirects any accesses to the specified meta collection.

Always allow access from 127.0.0.1 (or localhost) as this is required for some internal features to work such as the accessibility and content auditor tools.

Disable analytics updates on meta components

Ensure analytics.scheduled_database_update=false is set on component collections that don’t get accessed directly. That will prevent Funnelback from building analytics reports for a collection that has no data.

Use a dedicated profile for system generated queries

Meta-collections commonly require system generated queries (e.g. extra searches, Ajax searches, generation of configuration such as structured auto-completion CSV). Those system generated queries often require a set of specific query processor options.

Use a dedicated profile for such queries. A dedicated profile has the following benefits:

  • A padre_opts.cfg file can be setup to apply specific query processor options, rather than having to dynamically inject them with a hook script (simpler, less maintenance).

  • The query processor options can include the -log=false option to prevent queries to the profile to be logged.

  • If the -log=false option cannot be used, the system generated queries will be not logged against the profile that contains the real usage analytics for the search.

Update workflow for meta collections

While meta collections have a workflow, they are not updated like other collections. They get updated when their configuration is saved in the administration interface, or when a component collection updates (as part of the meta-dependencies workflow). Because of this there are some caveats to their workflow.

Meta dependencies

Consider disabling the meta dependencies phase on all but one component collection in a meta collection. Meta dependencies is responsible for generating spelling and query completion for the meta collection. This can take a long time for meta collections with large indexes and can also lock indexes on component collections while the meta dependencies phase is running (which can cause problems with updates on other components).

It is probably acceptable for the meta collection’s spelling/query completion to be updated once a day rather than on every component collection update, so linking it to the update of the most important component can improve of the efficiency of the overall update performance. The collection.cfg setting to disable the meta dependencies phase is:

meta_dependencies=false

Push collections

Handle updates and deletions

The system that submits content to a push collection is responsible for adding, updating and deleting content:

  • The push collection will not automatically expire any content - if an item needs to be removed from the index then the corresponding API call must be made to remove the item.

  • The system that submits content must use appropriate URIs for the documents and use this URI for any future update/delete operations.

Ensure errors are appropriately handled

The system that submits content to a push collection must handle a number of error cases that may be returned via the push API:

  • Push API is unavailable

  • Push API returns an error

In the event that there is an error the submitting system must decide what to do. This should be a combination of retrying the API failed call and a queuing mechanism so that none of the operations are lost.

Instant updates

Instant updates provide a way of updating a collection between scheduled updates.

Instant updates have limited support for the following collection types: web, matrix, filecopy, trim and database.

Ensure instant update workflow is set correctly

Instant updates have special update phases which mimic some of the update phases of a standard update. Instant updates require extra configuration for collections that have standard workflow defined to ensure that any required workflow is also run when the instant update runs. E.g. a post-index command that modifies the index requires an equivalent post instant index command that modifies the instant index.