Best practices - 2.2 collections
This article details implementation best practices related to collections.
Use lowercase characters and dashes only
Keep the names succinct and meaningful, concentrating on the purpose (e.g.
Avoid having the collection type as part of the name.
If you have multiple collections for the same client, prefix with the client (e.g.
For Government organisations that tend to change name regularly try to prefix using their purpose. (e.g.
Use well known abbreviations like
nswfor New South Wales.
Don’t use shortened names for example by dropping a
few letrs cos its hrd 2 undrstnd.
Avoid using complete domain names (e.g.
As a general rule use as few collections as possible as this has a number of benefits:
Maximises ranking opportunities (as cross-site links will be included in the ranking).
Fewer updates to schedule which has a lower resource footprint on the server.
Allows more repositories to be added to a meta collection (as meta collections have a limit on the number of collections that can be included (and any push collections that are included also eat into this limit as they are structured in a similar way to meta collections).
Separate out a collection:
If the collections need to be updated on different update cycles (e.g. a news section of a site updated every 30m, the rest of the site every 24h).
If the collections contain internal and external content and these should be separated because they are used in different meta collections (e.g. a public meta collection including an internet web collection and an internal meta collection containing intranet sites and the public sites).
If the site has NTLM authentication (only a single NTLM username and password can be specified on a collection). Note: sites with basic HTTP authentication can specify username and password per domain using site profiles (
site_profiles.cfg); and sites using form-based authentication can specify the settings using form interaction.
robots.txt, robots meta tags,
sitemap.xmland Funnelback noindex tags.
include/exclude URL patterns.
crawler non-html extensions.
parser mime types.
download and parser file size limit.
crawl depth limits.
Site profiles can be used to set various limits and also specify HTTP basic authentication to use when crawling specific domains.
Perform a post-crawl analysis that investigates various crawl logs while a web collection is under development, and periodically after the collection is live.
Look at the
url_errors.logto see if there are common patterns of URLs that should be excluded from the crawl.
Look at the
url_errors.logto see if there are files larger than the
crawler.max_download_sizethat should be stored.
Look at the tail of the stored.log to see if the crawler is getting into any crawler traps that can avoided by defining exclude patterns, of if there are other things being stored that shouldn’t be.
Look at the
crawl.log.*files to see if any crawler trap prevention has been triggered.
There are many settings that can be used to optimise the web crawler, that set various timeouts and limits.
If the site has a large number of domains then crawl speed can be increased if the number of crawler threads is increased as this sets an upper limit on the number of sites that can be crawled concurrently (the default is 20). Note: be sure to monitor how this affects the server performance as more threads will use more CPU. It may be necessary to increase the number of CPUs available to your Funnelback server.
If a site that is being crawled is sufficiently resourced then the site profiles can be used to specify multiple concurrent requests against the site (but be sure to get permission from the site owner first).
The database owner should create a view that contains all the data in a de-normalised form that is required for the Funnelback index. This allows Funnelback to run a simple SQL query (
select * from VIEWNAME). Using a view also makes it clearer to the content owner as they are in control of the view and can alter the fields as required.
When writing a custom gatherer ensure that it only implements the logic required to gather the data from the data source.
Use the filter framework to perform all filtering tasks on custom data - this ensures that all filtering is performed in the same way with the same interchangeable filters, regardless of the collection type. The filter framework has been available for custom collections since Funnelback 15.12.
Templates can be shared between profiles by referencing the master template using a Freemarker
<#include> directive. See: sharing templates across profiles
Do not use symbolic links to share configuration files between collections or profiles.
Symbolic linking should be avoided because it is unclear to the end user that the settings they are changing may impact on other collections that also use the same configuration file. Use of symbolic linking may also break newer product functionality (such as configuration that is published per-entry rather than per file as was previously done).
|Symbolic linking of configuration files was previously advised as a best practice with meta collections. This is no longer the case as the way Funnelback handles configuration has changed.|
For collections that are not supposed to be queried directly, prevent access with:
access_restriction=127.0.0.1 (And possibly any other internal range to permit monitoring) access_alternate=<meta-collection-id>
This prevents query access to the collection and redirects any accesses to the specified meta collection.
Always allow access from 127.0.0.1 (or localhost) as this is required for some internal features to work such as the accessibility and content auditor tools.
analytics.scheduled_database_update=false is set on component collections that don’t get accessed directly. That will prevent Funnelback from building analytics reports for a collection that has no data.
Meta-collections commonly require system generated queries (e.g. extra searches, Ajax searches, generation of configuration such as structured auto-completion CSV). Those system generated queries often require a set of specific query processor options.
Use a dedicated profile for such queries. A dedicated profile has the following benefits:
padre_opts.cfgfile can be setup to apply specific query processor options, rather than having to dynamically inject them with a hook script (simpler, less maintenance).
The query processor options can include the
-log=falseoption to prevent queries to the profile to be logged.
-log=falseoption cannot be used, the system generated queries will be not logged against the profile that contains the real usage analytics for the search.
While meta collections have a workflow, they are not updated like other collections. They get updated when their configuration is saved in the administration interface, or when a component collection updates (as part of the meta-dependencies workflow). Because of this there are some caveats to their workflow.
Consider disabling the meta dependencies phase on all but one component collection in a meta collection. Meta dependencies is responsible for generating spelling and query completion for the meta collection. This can take a long time for meta collections with large indexes and can also lock indexes on component collections while the meta dependencies phase is running (which can cause problems with updates on other components).
It is probably acceptable for the meta collection’s spelling/query completion to be updated once a day rather than on every component collection update, so linking it to the update of the most important component can improve of the efficiency of the overall update performance. The
collection.cfg setting to disable the meta dependencies phase is:
The system that submits content to a push collection is responsible for adding, updating and deleting content:
The push collection will not automatically expire any content - if an item needs to be removed from the index then the corresponding API call must be made to remove the item.
The system that submits content must use appropriate URIs for the documents and use this URI for any future update/delete operations.
The system that submits content to a push collection must handle a number of error cases that may be returned via the push API:
Push API is unavailable
Push API returns an error
In the event that there is an error the submitting system must decide what to do. This should be a combination of retrying the API failed call and a queuing mechanism so that none of the operations are lost.
Instant updates provide a way of updating a collection between scheduled updates.
Instant updates have limited support for the following collection types: web, matrix, filecopy, trim and database.
Instant updates have special update phases which mimic some of the update phases of a standard update. Instant updates require extra configuration for collections that have standard workflow defined to ensure that any required workflow is also run when the instant update runs. E.g. a post-index command that modifies the index requires an equivalent post instant index command that modifies the instant index.