However, it is also possible to administer Funnelback from the command line. This is useful if other systems need to be integrated with Funnelback.
We assume in the following instructions that the $SEARCH_HOME environment variable is defined. This should point to your installation directory. By default, this is /opt/funnelback/ on Linux and C:\funnelback\ on Windows.
- bin contains all administration scripts.
- conf contains various global configuration files, as well as collection specific configuration files, under $SEARCH_HOME/conf/<collection name>.
- log contains global log files, such as the create.log file, which records creation of collections and the delete.log file, which records deletion of collections.
- web contains files relating to the admin console and public search interface (such as the cgi files). Web server configuration files are stored in $SEARCH_HOME/web/conf/.
- data contains collection specific data, such as gathered documents, indexes and log files.
The data area has the following structure:
- Each collection will have a subdirectory under data, containing live, offline, log and archive directories.
- The archive directory contains compressed query and click log files for the collection.
- The log directory contains logs that don't fit into any other category, such as update logs and reporting logs.
- The live and offline directories contain gathered and filtered documents (in "data"), indexes (in "idx") and collection specific logs (in "log").
Creating a collection
All collection configuration files are created from a collection template at $SEARCH_HOME/conf/collection.cfg.default.
All configuration information for a collection is stored in a directory at $SEARCH_HOME/conf/<collection name>/. This includes the main collection.cfg file.
To create a collection from the command line, administrators can create the collection configuration directory, copy the collection template to collection.cfg in this directory, edit the collection configuration and run create-collection.pl over the collection configuration.
A separate convenience script, new-collection.pl, is available and will create the configuration directory and collection configuration file automatically. An optional start URL or location can be passed to this script, as well as a type, allowing the creation of web, local, filecopy, database collections, etc.
The created collection configuration should still be manually checked and edited to change default configuration options. The following options are especially important to check:
Creating a meta collection
A meta collection is one which has no data or indexes of its own but instead points to a set of underlying collections. To create a meta collection, administrators can use the new-collection.pl script, specifying a "meta" collection type.
The administrator must then create a meta.cfg file in the appropriate location: $SEARCH_HOME/conf/<collection name>/meta.cfg. This file is used to list the sub-collections which make up the meta collection.
The format is to list the internal names of the sub-collections, one per line. For example, the file might look like:
You also need to create an index.sdinfo file which lists the full path to the index stems for the subsidiary collections. This file should be placed in $SEARCH_HOME/data/<collection name>/live/idx/ and $SEARCH_HOME/data/<collection name>/offline/idx/, and will look something like:
Once this is done the meta collection will be as up to date as its component subcollections. This means that you do not need to call the update script for a meta collection.
Updating a collection
To update a collection, use the update.pl script, redirecting the output status messages to an appropriately named update log e.g. update-<collection>.log:
update.pl $SEARCH_HOME/conf/example/collection.cfg > $SEARCH_HOME/log/update-example.log 2>&1
Note that an update may take a significant amount of time, depending upon the update timeout, number of documents found and other factors.
During the update, messages will be logged to the appropriate logs in $SEARCH_HOME/data/<collection name>/offline/log/ and $SEARCH_HOME/data/<collection name>/log/.
To prevent multiple simultaneous updates of the same collection, update.pl will create a lock file at the start of an update. This lock file will be placed at $SEARCH_HOME/data/<collection name>/log/<collection name>.lock. A collection update will not occur unless update.pl can create and gain exclusive access to this lock file. The lock file is removed at the end of a successful update or if an error occurs during the update.
The various update scripts will also write to a state file at $SEARCH_HOME/data/<collection name>/log/<collection name>.state. This state file will contain text indicating the state of the relevant collection:
An additional collection.state file is written to the $SEARCH_HOME/conf/<collection name>/ directory for web collections. This file contains the following parameter:
which stores the number of incremental gathers that will be done before a full gather is triggered. The value is decremented each time an incremental crawl is done and will be reset to the value of schedule.incremental_crawl_ratio when it reaches zero.
Deleting a Collection
Administrators may fully delete a collection using the delete-collection.pl script. This script will delete all data and configuration associated with the deleted collection:
- gathered documents
- configuration files
- scheduled updates
User configuration files are also edited to remove references to the deleted collection.
Command line scripts reference
Detailed internal documentation may be gained for many scripts through the standard Perl "perldoc" command.
new-collection.pl creates a collection, including its collection.cfg file.
new-collection.pl <collection name> <collection type> [start url]
create-collection.pl creates a collection from an already existing collection.cfg file.
create-collection.pl <collection config>
delete-collection.pl deletes a collection, including its gathered documents, indexes, configuration, scheduled updates and logs. It also removes references to the now non-existent collection from user configuration files.
delete-collection.pl <collection config>
update.pl is a wrapper around the entire update process, and calls the appropriate update subscripts.
update.pl <collection config> [update type: -incremental, -reindex, …]
crawl.pl gathers documents from web collections.
crawl.pl <collection config> [update type: -check, -incremental, -instant-update]
crawl-stop.pl gracefully stops a web crawl.
crawl-stop.pl <collection config>
filecopy.pl gathers documents from filecopy collections.
filecopy.pl <collection config> [other options]
dbgather.pl gathers documents from database collections.
dbgather.pl <collection config> [--full] [other options]
index.pl calls Padre to index a collections documents.
index.pl <collection config> [-reindex] [-instant-update]
For collections using warc files to cache gathered content. index-warc-files.pl builds a new index for the collection's warc files in the live view. Intended to be used when upgrading between versions of Funnelback.
index-warc-files.pl <collection name>
make_report.pl processes a collections data files, producing reports on their contents.
make_report.pl <--collection "collection config"> [--log] …
outliers-log-processing.pl updates the Trend Alerts reports for a collection (or all collections if none is specified).
outliers-log-processing.pl [--collection "collection name"]
swap-views.pl swaps the live and offline views of a collection after a successful update, placing the newly gathered and indexed data in live for querying, and safely storing the older gathered and indexed data in offline.
swap-views.pl <collection config> [-force]
archive-log.pl archives a collections queries.log and clicks.log log files to the collection's archive directory.
archive-log.pl <collection config> [view]
reports-load-queries-log.pl reads a collections log files and stores a binary database for reporting purposes. The admin UI report frontend will read this database for displaying reports.
reports-load-queries-log.pl <--collection "collection internal name"> [-v] [-v] [-v] [-v]
reports-send-email.pl sends email query reports to users who have requested them for the specified collection (or for all if none was specified).
reports-send-email.pl [--collection "collection name"]
check_best_bets_links.pl checks that each link in each collection's best bets file is still available
Update the location of the perl interpreter for all .cgi and .pl scripts
Trigger local or remote administrative tasks
Change a users password.
$SEARCH_HOME/web/bin/change_password.sh <user> <password>