Document flags
This feature is not available in the Squiz DXP. |
killing of documents from the index is supported by configuring the kill_exact.cfg or kill_partial.cfg data source configuration files.
|
Funnelback defines a set of document flags which track a number of properties for each document in the search index. This includes whether or not it has been identified as a duplicate or to be killed from the index.
The padre-fl
program provides the ability to manipulate these document flags (without reindexing) and can be an efficient way of complying with requirements to quickly remove search results from a public website. This document describes the command line interface to manipulating document flags, which may be useful from automated scripts.
The most common use of padre-fl
is to kill documents from the index. This sets a flag on the document that removes the document for being visible in the search index.
padre-fl
padre-fl
is the program responsible for setting and unsetting document flags, and is located in the bin directory within your Funnelback installation. padre-fl
supports the following usage:
Usage1: /opt/funnelback/bin/padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry|-quicken] Usage2: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill Usage3: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill Usage4: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR Usage5: /opt/funnelback/bin/padre-fl <index_stem> -kill-docnum-list <file_of_docnums> Usage6: /opt/funnelback/bin/padre-fl -v
In each case, the <index_stem>
should be set to the path of a Funnelback index, generally of the form install_path/data/collection_name/live/idx/index
.
The <file_of_url_patterns>
is a text file containing one URL per line. Note that patterns are simple strings only, do not support wildcard characters or regular expression type patterns and must match the canonicalized URLs that are used within the index. kill_exact.cfg
and kill_partial.cfg
are files in this format and are used to automatically kill documents from the index when a data source update is run.
<file_of_docnums>
is a text file containing 1 document number per line. Note that document numbers for a specific document are dependent on the order which documents are indexed meaning that they will change from update to update - this should only be used when the corresponding index is used to generate the list of document numbers that will then be passed to padre-fl
.
Usage 1 provides a number of generic options that can interact with the flags set on the index.
The -show
and -sumry
options provide an overview of the flags currently set on the index. In show
's case, 11 flags are shown for each document, representing the following flags:
-
expired documents
-
killed documents
-
duplicate documents
-
noindex documents
-
filtered binary documents
-
documents without an early binding security lock
-
documents with paid ads
-
unfiltered binary documents
-
documents matching admin specified regex
-
noarchive documents
-
nosnippet documents
The -clearkill
and -killall
options will set the kill bit for all documents in the index off or on respectively. clearkill
may be useful if kill bits have been set accidentally, and -killall
provides a quick mechanism for removing all search results. -clearall
provides a way to completely clear all flags.
The -quicken
option produces an index.urls_sorted
index file that is used to speed up the killing of documents when padre-fl
is run repeatedly to kill documents from the index. This is mostly useful when working with push collections.
Usages 2 and 3 allow a set of specific documents to be have their kill bits set or unset based on a list of URL patterns provided in an input file. By default, the URL patterns in the file are left-anchored (i.e. any URL which starts with the given pattern will be affected), but if the -exactmatch
option is used, URLs will only be affected if the exactly match the given pattern.
Usage 5 will set the kill bit on a set of documents specified using the document number rather than the URL.
Usage 4 can be used to set arbitrary index flags. See the advanced usage section below.
Usage 6 prints out the padre-fl
version.
Non-web data sources
In some cases, Funnelback must create a URL for documents which do not otherwise have one (e.g. records from a database data source). The simplest way to identify the URL of a specific document in this case is to view the search.json output of search results and observe the indexUrl
property for each search result.
Pattern files
Kill patterns can be applied automatically during a collection update when by setting up the following configuration files:
-
kill_exact.cfg
- defines a list of specific URLs to kill from the index. (padre-fl
is run with the-kill
and-exactmatch
options). -
kill_partial.cfg
- defines a list of URL stems to kill. Any URL starting with these stems will be killed. (padre-fl
is run with the-kill
option).
Advanced usage
Clear individual document flags on a set of URLs
The following commands can be used to clear the respective flag bits in the index.
For each command:
-
<INDEX-STEM>
: is the index stem for the collection’s index -
<URL-LIST>
: is a file containing the URL pattern of items to match when the command runs. The file strings that are matched to the start of URLs (same format as for akill_partial.cfg
). Alternatively use - as the<URL_LIST>
to specify a URL pattern on STDIN.
Clear expired documents flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fe AND
Clear killed documents flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fd AND
Clear duplicate documents flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fb AND
Clear noindex documents flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7f7 AND
Clear filtered binary documents flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7ef AND
Clear documents without an early binding security lock flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7df AND
Clear documents with paid ads flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7bf AND
Clear unfiltered binary documents flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 77f AND
Clear documents matching admin specified regex flag
Use the following padre-fl
command:
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 6ff AND
Technical steps
padre-fl
when used with the -bits
option combines the supplied hex string with the selected rows of the document flag table with the specified logical operation.
The flags table consists of 11 bits which are either set (1) or not set (0). The bits (from right to left) are:
-
expired documents
-
killed documents
-
duplicate documents
-
noindex documents
-
filtered binary documents
-
documents without an early binding security lock
-
documents with paid ads
-
unfiltered binary documents
-
documents matching admin specified regex
-
noarchive documents
-
nosnippet documents
A document can have several flags set. These can be set or cleared using padre-fl
with the -bits
option, a bitmask to apply and logic operation to use. Using this knowledge it is possible to set or clear multiple bits in one operation. The set flags can be viewed by running padre-fl
using the -show
option.
For example to clear just the duplicate bit (as above) works as follows.
Figure out which column contains the duplicate bit and represent this as a binary number. The duplicate documents flag is the third bit and corresponds to:
00000000100
To clear this bit it needs to combined (using a logical AND
operation) with the following bitmask:
11111111011
or 7fb
in hexadecimal.
This correspond to the following padre-fl
command
$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fb AND
The same process can be applied to multiple items using the same process as outlined above.