Document flags

This feature is not available in the Squiz DXP.
killing of documents from the index is supported by configuring the kill_exact.cfg or kill_partial.cfg data source configuration files.

Funnelback defines a set of document flags which track a number of properties for each document in the search index. This includes whether or not it has been identified as a duplicate or to be killed from the index.

The padre-fl program provides the ability to manipulate these document flags (without reindexing) and can be an efficient way of complying with requirements to quickly remove search results from a public website. This document describes the command line interface to manipulating document flags, which may be useful from automated scripts.

The most common use of padre-fl is to kill documents from the index. This sets a flag on the document that removes the document for being visible in the search index.

padre-fl

padre-fl is the program responsible for setting and unsetting document flags, and is located in the bin directory within your Funnelback installation. padre-fl supports the following usage:

Usage1: /opt/funnelback/bin/padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry|-quicken]

Usage2: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill

Usage3: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill

Usage4: /opt/funnelback/bin/padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR

Usage5: /opt/funnelback/bin/padre-fl <index_stem> -kill-docnum-list <file_of_docnums>

Usage6: /opt/funnelback/bin/padre-fl -v

In each case, the <index_stem> should be set to the path of a Funnelback index, generally of the form install_path/data/collection_name/live/idx/index.

The <file_of_url_patterns> is a text file containing one URL per line. Note that patterns are simple strings only, do not support wildcard characters or regular expression type patterns and must match the canonicalized URLs that are used within the index. kill_exact.cfg and kill_partial.cfg are files in this format and are used to automatically kill documents from the index when a data source update is run.

<file_of_docnums> is a text file containing 1 document number per line. Note that document numbers for a specific document are dependent on the order which documents are indexed meaning that they will change from update to update - this should only be used when the corresponding index is used to generate the list of document numbers that will then be passed to padre-fl.

Usage 1 provides a number of generic options that can interact with the flags set on the index.

The -show and -sumry options provide an overview of the flags currently set on the index. In show's case, 11 flags are shown for each document, representing the following flags:

  1. expired documents

  2. killed documents

  3. duplicate documents

  4. noindex documents

  5. filtered binary documents

  6. documents without an early binding security lock

  7. documents with paid ads

  8. unfiltered binary documents

  9. documents matching admin specified regex

  10. noarchive documents

  11. nosnippet documents

The -clearkill and -killall options will set the kill bit for all documents in the index off or on respectively. clearkill may be useful if kill bits have been set accidentally, and -killall provides a quick mechanism for removing all search results. -clearall provides a way to completely clear all flags.

The -quicken option produces an index.urls_sorted index file that is used to speed up the killing of documents when padre-fl is run repeatedly to kill documents from the index. This is mostly useful when working with push collections.

Usages 2 and 3 allow a set of specific documents to be have their kill bits set or unset based on a list of URL patterns provided in an input file. By default, the URL patterns in the file are left-anchored (i.e. any URL which starts with the given pattern will be affected), but if the -exactmatch option is used, URLs will only be affected if the exactly match the given pattern.

Usage 5 will set the kill bit on a set of documents specified using the document number rather than the URL.

Usage 4 can be used to set arbitrary index flags. See the advanced usage section below.

Usage 6 prints out the padre-fl version.

Non-web data sources

In some cases, Funnelback must create a URL for documents which do not otherwise have one (e.g. records from a database data source). The simplest way to identify the URL of a specific document in this case is to view the search.json output of search results and observe the indexUrl property for each search result.

Pattern files

Kill patterns can be applied automatically during a collection update when by setting up the following configuration files:

  • kill_exact.cfg - defines a list of specific URLs to kill from the index. (padre-fl is run with the -kill and -exactmatch options).

  • kill_partial.cfg - defines a list of URL stems to kill. Any URL starting with these stems will be killed. (padre-fl is run with the -kill option).

Advanced usage

Clear individual document flags on a set of URLs

The following commands can be used to clear the respective flag bits in the index.

For each command:

  • <INDEX-STEM>: is the index stem for the collection’s index

  • <URL-LIST>: is a file containing the URL pattern of items to match when the command runs. The file strings that are matched to the start of URLs (same format as for a kill_partial.cfg). Alternatively use - as the <URL_LIST> to specify a URL pattern on STDIN.

Clear expired documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fe AND

Clear killed documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fd AND

Clear duplicate documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fb AND

Clear noindex documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7f7 AND

Clear filtered binary documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7ef AND

Clear documents without an early binding security lock flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7df AND

Clear documents with paid ads flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7bf AND

Clear unfiltered binary documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 77f AND

Clear documents matching admin specified regex flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 6ff AND

Clear noarchive documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 5ff AND

Clear nosnippet documents flag

Use the following padre-fl command:

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 3ff AND

Technical steps

padre-fl when used with the -bits option combines the supplied hex string with the selected rows of the document flag table with the specified logical operation.

The flags table consists of 11 bits which are either set (1) or not set (0). The bits (from right to left) are:

  1. expired documents

  2. killed documents

  3. duplicate documents

  4. noindex documents

  5. filtered binary documents

  6. documents without an early binding security lock

  7. documents with paid ads

  8. unfiltered binary documents

  9. documents matching admin specified regex

  10. noarchive documents

  11. nosnippet documents

A document can have several flags set. These can be set or cleared using padre-fl with the -bits option, a bitmask to apply and logic operation to use. Using this knowledge it is possible to set or clear multiple bits in one operation. The set flags can be viewed by running padre-fl using the -show option.

For example to clear just the duplicate bit (as above) works as follows.

Figure out which column contains the duplicate bit and represent this as a binary number. The duplicate documents flag is the third bit and corresponds to:

00000000100

To clear this bit it needs to combined (using a logical AND operation) with the following bitmask:

11111111011

or 7fb in hexadecimal.

This correspond to the following padre-fl command

$ padre-fl <INDEX-STEM> <URL-LIST> -bits 7fb AND

The same process can be applied to multiple items using the same process as outlined above.