Funnelback logo

Documentation

TRIMPush collection

Introduction

TRIMPush collection is an evolution of the TRIM collection:

  • It's using an improved crawler with better reliability and improved performance,
  • The TRIM records and attachments are stored in a Push collection and can be searched as soon as they have been crawled.

For more background information about TRIM and Push collection please consult the correponding sections:

This document assumes prior knowledge of TRIM collections and how to setup up the required environment (Installing the TRIM Client and SDK, etc.).

Note: Crawling large TRIM repositories usually require increasing the memory settings from the defaults, for the filtering service and Push collections. This can be done in SEARCH_HOME/services/daemon.service and SEARCH_HOME/services/continuous.service by locating the line containing the -Xmx...m setting. A good starting point would be to use 512 MB for the filtering service, and 1.5GB for Push collections.

Creating a TRIMPush collection

TRIMPush collections store their data in a Push collection, so a Push collection must first be created.

TrimPush-CreatePush.png

The indexer_options for the collection must be updated to include the -forcexml option.

You might want to customise the metadata mappings for this collection to better suit TRIM records. To do so edit xml.cfg and use for example the following mapping:

PADRE XML Mapping Version: 2
docurl,/trim_record/funnelback_url
docfiletype,/content_metadata/Funnelback.Format
t,1,,//recTitle
a,1,,//recCreator
d,1,,//dateLastUpdated
e,1,,//recRecordType
j,1,,//recNumber
S,4,,//funnelback_lockstring
+,,,/trim_record/content

The TRIMPush collection itself can then be created, using the new Push collection as a target:

TrimPush-Create.png

TrimPush-Settings.png

The screenshot above shows the TRIMPush "create collection" page in the Funnelback Administration interface. The key fields to note are:

Database ID
is the two-alphanumeric ID for the TRIM database.
TRIM Workgroup Server
The name of your TRIM server.
TRIM Workgroup Server Port
The TCP port your TRIM server operates on.
Domain name, Username, Password
Credentials of the TRIM crawling user.
Gather documents beginning from
documents registered or modified since this date will be gathered.
Stop gathering at
Stop the gathering when a specific date is reached. If blank the crawler will gather everything up to the most recent record.
Select records on
Whether to select record based on their registration date (First time they were checked-in TRIM), their creation date (date of the file creation of the binary attachment of a record) or their last updated date (date of the latest modification in TRIM). Usually the former is selected for the initial crawl to retrieve all the content, then for daily updates it's switched to the latter.
Document types
are the file types you want extracted from the database. Note that records without electronic documents can still be gathered as well.
Request delay
is the time, in milliseconds, between requests for records.
Num. of threads
Number of simultaneous connections to the TRIM server. See trim.threads (collection.cfg) for more details and gather.slowdown.threads (collection.cfg) to setup a throttle schedule.
Time span, Time span unit
How the date range to gather should be split (See below for explanation).
Target push collection
The Push collection to use to store records.


Overview of the crawl process

The TRIMPush gatherer usually crawls backward: It will start crawling records created today (or specified in the Stop gathering at setting) and work its way to the Gather documents beginning from date. This is to have the most recently created documents available for searching first.

Date range

The date range to gather is usually large on the very first crawl (for example from 2000 to today). In order to speed up the crawl and improve reliability the crawler will split this date range in smaller chunks: This is what the Time span and Time span unit settings are for. Using a smaller time span will make selecting records faster but will cause the selection to happen more frequently. The best setting is dependent of the TRIM server capacity and the number of record to gather.

Checkpoints

The crawler creates a checkpoint after each record processed. If the crawl fails and is restarted the crawler will restart from the latest checkpoint.

Checkpoints can be deleted by using the Advanced update link of the Update tab, from the Admin UI home page.

Updating TRIM Collections

The TRIMPush crawler attempts to impersonate the gather user so that updates can be triggered from the Admin UI. In some cases this impersonation can fail (depending on the service user that runs Funnelback, and on the Windows Domain configuration). If so the update will have to be started from the command line, either as an Administrator or as the gathering user using the runas /user:DOMAIN\User ... command.

Record types and classification security

A separate utility will collect the available record types and classifications from the TRIM repository in order to enforce security based on these items.

This utility must be run regularly so that the available record types and classifications are kept in sync. The usual way to do it is to setup a post_update command on the TRIMPush collection so that they get exported every time the TRIMPush collection is updated, usually on a daily basis. When you create a new TRIMPush collection, the post_update command is automatically configured with: post_update_comand=$SEARCH_HOME\wbin\trim\Funnelback.TRIM.ExportSecurityInfo.exe $COLLECTION_NAME

This will place the record types and classifications security files inside the live/databses/ folder of the target Push collection, as it's the Push collection that will be queried.

Extra collection options

There are some additional settings that affect the TRIMPush crawler adapter. These can be accessed by selecting the TRIMPush collection in the Administration Interface, going to the "Administer" tab and clicking on the "Edit Collection Settings" link. You can then click on the "TRIM extras" tab, which will display the following:

TrimPush-Extras.png

The Default live links and TRIM License number settings are identical to TRIM collection. The other ones are specific to TRIMPush collections:

Statistics dump interval
The interval (in seconds) at which statistics will be written to the monitor.log file name. Reducing this interval will produce more fine grained monitor information and graphs, but might induce overhead.
Progress report interval
The interval (in seconds) at which the crawler will update the current progress status that is display on the Admin home page.
Web server work path
TRIM needs a temporary folder where to extract attachments during the crawl, the location of this folder can be set there. If not set, the default will be used (Usually C:\HPTRIM\WebServerWorkPath). This folder can grow quite large during a crawl.
Properties black list
List of record properties that should not be extracted (See below)
User fields black list
List of user fields that should not be extracted (See below)


Properties and user fields

By default the TRIM crawler will try to extract as much properties and user fields as possible for all records. In some circumstance skipping some properties and user fields might be relevant:

  • Some properties are expensive to compute, such as the ones involving dates calculations. Skipping those properties will significantly improve the crawler speed.
  • Some properties should not be displayed in search results, or are not accessible by the crawl user.
  • Some properties are irrelevant to search on.

The Properties black list and User fields black list controls which property will be ignored during the crawl. Each list must contain one property per line, using the internal property name. To find out which property names are available the Collection tools link from the Administer tab can be used:

TrimPush-PropertiesList.png

Serving search results

The search results will be served from the Push collection, as it's the one that holds all the gathered data. Some TRIM settings might need to be replicated on the Push collection to have results served properly, such as trim.workgroup_server and trim.database. See the Push collections section for additional details.

See also

top ⇑