Best practices - 2.2 collections
Background
This article details implementation best practices related to collections.
Collection naming conventions
-
Use lowercase characters and dashes only
-
Keep the names succinct and meaningful, concentrating on the purpose (e.g.
public-website
,intranet
). -
Avoid having the collection type as part of the name.
-
If you have multiple collections for the same client, prefix with the client (e.g.
client-public-website
,client-intranet
)-
For Government organisations that tend to change name regularly try to prefix using their purpose. (e.g.
health
instead ofdept-health-and-wellbeing
ordhw
)
-
-
Use well known abbreviations like
nsw
for New South Wales. -
Don’t use shortened names for example by dropping a
few letrs cos its hrd 2 undrstnd
. -
Avoid using complete domain names (e.g.
www-client-com-au
)
Web collections
One web collection or many?
As a general rule use as few collections as possible as this has a number of benefits:
-
Maximises ranking opportunities (as cross-site links will be included in the ranking).
-
Fewer updates to schedule which has a lower resource footprint on the server.
-
Allows more repositories to be added to a meta collection (as meta collections have a limit on the number of collections that can be included (and any push collections that are included also eat into this limit as they are structured in a similar way to meta collections).
Separate out a collection:
-
If the collections need to be updated on different update cycles (e.g. a news section of a site updated every 30m, the rest of the site every 24h).
-
If the collections contain internal and external content and these should be separated because they are used in different meta collections (e.g. a public meta collection including an internet web collection and an internal meta collection containing intranet sites and the public sites).
-
If the site has NTLM authentication (only a single NTLM username and password can be specified on a collection). Note: sites with basic HTTP authentication can specify username and password per domain using site profiles (
site_profiles.cfg
); and sites using form-based authentication can specify the settings using form interaction.
Control what is included and excluded
Including:
-
robots.txt
, robots meta tags,sitemap.xml
and Funnelback noindex tags. -
include/exclude URL patterns.
-
crawler non-html extensions.
-
parser mime types.
-
download and parser file size limit.
-
crawl depth limits.
Use site profiles to limit multi-site crawls
Site profiles can be used to set various limits and also specify HTTP basic authentication to use when crawling specific domains.
Web collection optimisation
Perform a post-crawl analysis
Perform a post-crawl analysis that investigates various crawl logs while a web collection is under development, and periodically after the collection is live.
-
Look at the
url_errors.log
to see if there are common patterns of URLs that should be excluded from the crawl. -
Look at the
url_errors.log
to see if there are files larger than thecrawler.max_download_size
that should be stored. -
Look at the tail of the stored.log to see if the crawler is getting into any crawler traps that can avoided by defining exclude patterns, of if there are other things being stored that shouldn’t be.
-
Look at the
crawl.log.*
files to see if any crawler trap prevention has been triggered.
Optimise settings
There are many settings that can be used to optimise the web crawler, that set various timeouts and limits.
If the site has a large number of domains then crawl speed can be increased if the number of crawler threads is increased as this sets an upper limit on the number of sites that can be crawled concurrently (the default is 20). Note: be sure to monitor how this affects the server performance as more threads will use more CPU. It may be necessary to increase the number of CPUs available to your Funnelback server.
If a site that is being crawled is sufficiently resourced then the site profiles can be used to specify multiple concurrent requests against the site (but be sure to get permission from the site owner first).
Database collections
Request a Funnelback-specific database view
The database owner should create a view that contains all the data in a de-normalised form that is required for the Funnelback index. This allows Funnelback to run a simple SQL query (select * from VIEWNAME
). Using a view also makes it clearer to the content owner as they are in control of the view and can alter the fields as required.
Custom collections
Custom gatherer should only implement gather logic
When writing a custom gatherer ensure that it only implements the logic required to gather the data from the data source.
Use the filter framework to perform all filtering tasks on custom data - this ensures that all filtering is performed in the same way with the same interchangeable filters, regardless of the collection type. The filter framework has been available for custom collections since Funnelback 15.12.
Meta collections
Shared configuration files
Sharing templates between profiles
Templates can be shared between profiles by referencing the master template using a Freemarker <#include>
directive. See: sharing templates across profiles
Symbolic linking of configuration files
Do not use symbolic links to share configuration files between collections or profiles.
Symbolic linking should be avoided because it is unclear to the end user that the settings they are changing may impact on other collections that also use the same configuration file. Use of symbolic linking may also break newer product functionality (such as configuration that is published per-entry rather than per file as was previously done).
Symbolic linking of configuration files was previously advised as a best practice with meta collections. This is no longer the case as the way Funnelback handles configuration has changed. |
Prevent access to component collections
For collections that are not supposed to be queried directly, prevent access with:
access_restriction=127.0.0.1 (And possibly any other internal range to permit monitoring)
access_alternate=<meta-collection-id>
This prevents query access to the collection and redirects any accesses to the specified meta collection.
Always allow access from 127.0.0.1 (or localhost) as this is required for some internal features to work such as the accessibility and content auditor tools.
Disable analytics updates on meta components
Ensure analytics.scheduled_database_update=false
is set on component collections that don’t get accessed directly. That will prevent Funnelback from building analytics reports for a collection that has no data.
Use a dedicated profile for system generated queries
Meta-collections commonly require system generated queries (e.g. extra searches, Ajax searches, generation of configuration such as structured auto-completion CSV). Those system generated queries often require a set of specific query processor options.
Use a dedicated profile for such queries. A dedicated profile has the following benefits:
-
A
padre_opts.cfg
file can be setup to apply specific query processor options, rather than having to dynamically inject them with a hook script (simpler, less maintenance). -
The query processor options can include the
-log=false
option to prevent queries to the profile to be logged. -
If the
-log=false
option cannot be used, the system generated queries will be not logged against the profile that contains the real usage analytics for the search.
Update workflow for meta collections
While meta collections have a workflow, they are not updated like other collections. They get updated when their configuration is saved in the administration interface, or when a component collection updates (as part of the meta-dependencies workflow). Because of this there are some caveats to their workflow.
Meta dependencies
Consider disabling the meta dependencies phase on all but one component collection in a meta collection. Meta dependencies is responsible for generating spelling and query completion for the meta collection. This can take a long time for meta collections with large indexes and can also lock indexes on component collections while the meta dependencies phase is running (which can cause problems with updates on other components).
It is probably acceptable for the meta collection’s spelling/query completion to be updated once a day rather than on every component collection update, so linking it to the update of the most important component can improve of the efficiency of the overall update performance. The collection.cfg
setting to disable the meta dependencies phase is:
meta_dependencies=false
Push collections
Handle updates and deletions
The system that submits content to a push collection is responsible for adding, updating and deleting content:
-
The push collection will not automatically expire any content - if an item needs to be removed from the index then the corresponding API call must be made to remove the item.
-
The system that submits content must use appropriate URIs for the documents and use this URI for any future update/delete operations.
Ensure errors are appropriately handled
The system that submits content to a push collection must handle a number of error cases that may be returned via the push API:
-
Push API is unavailable
-
Push API returns an error
In the event that there is an error the submitting system must decide what to do. This should be a combination of retrying the API failed call and a queuing mechanism so that none of the operations are lost.
Instant updates
Instant updates provide a way of updating a collection between scheduled updates.
Instant updates have limited support for the following collection types: web, matrix, filecopy, trim and database.
Ensure instant update workflow is set correctly
Instant updates have special update phases which mimic some of the update phases of a standard update. Instant updates require extra configuration for collections that have standard workflow defined to ensure that any required workflow is also run when the instant update runs. E.g. a post-index command that modifies the index requires an equivalent post instant index command that modifies the instant index.