Padre binaries and command line usage

Background

This page lists all the Padre binaries and their corresponding usage messages.

1. FineTune

Purpose: Tuning padre-sw ranking parameters based on a C-TEST file.

Usage: FineTune <collection>[.<profile>] ... [-perl_bin=path_to_perl_bin] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-alpha=<f>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][<mode> ...] [-conf] [-qp=padre_subpath] [-index_dir.<collection>=<index directory>] [-lock_file=<file to lock>]
   e.g. FineTune lse -daat -annieonly
   e.g. FineTune agosp.doha -timeout=7.5
        Use -daat0 to tune term-at-a-time.

        Timeouts and query limits:  Apply separately to each mode. Defaults
         are 5 hours and 1 million queries. After a timeout, or when the
         query limit has been exceeded, the best tuning found so far for
         that mode will be recorded in the .best file for the mode.

        -perl_bin=/path/to/perl used to set the path to perl binary to use

        -alpha sets the balance between success rate and wmum1 in tuning. [Dflt 0.75]
          Value must lie between 0 (ignore success rate) and 1 (ignore wmum1)

        -rvalues - sets no. of values to explore for optype=2 (real) [Dflt 11]

        -adjust - sets no. of steps to remove when adjusting exploration range
                   for optype=2 (real) dimensions. [Dflt 5]

        -conf extracts the mode to tune from collection.cfg. (N/A for multituning.)

        -help gives more detailed instructions and exits.

        -index_dir can be used to set the index directory for a particular collection.
         The directory must contain a index prefixed by 'index'.)

        -lock_file can be used to lock a file for the entire duration of tuning, if
          the lock can not be acquired tuning will not start.)

        -redirect_stdout can be used to redirect stdout to a given file.

        -redirect_stderr can be used to redirect stderr to a given file.

        -write_finish_time_to writes the tuning finish time in ISO-8601 to the given
          file.

2. QiTune

Purpose: Tuning padre-sw query-independent ranking parameters based on a C-TEST file.

Usage: QiTune stem C-TESTfile

Given a PADRE index and a file of useful URLs (extracted from the
C-TEST file) compute a set of query-independent cool settings
suitable for passing to padre-do which (hopefully) optimise the
difference in ave scores between the useful docs and the general collection.

3. SpellTune

Purpose: Tuning PADRE spelling suggestion system based on a test file e.g. mycoll.spelltest.

Usage: SpellTune <collection>[.<profile>] ... [-tune_bsi] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][-qp=padre_subpath]
   e.g. SpellTune lse -annieonly
   e.g. SpellTune agosp.doha -timeout=7.5
        (Timeouts and query limit defaults are 5 hours
         and 1 million queries. After a timeout, or when the
         query limit has been exceeded, the best tuning found so far
         will be recorded in the .bestspell file for the mode.)
        (-tune_bsi - tunes the build_spelling_index params.  Slow.)
        (-rvalues - sets no. of values to explore for optype=2 (real)) [Dflt 11]
        (-adjust - sets no. of steps to remove when adjusting exploration range
                   for optype=2 (real) dimensions.) [Dflt 5]
        (-help gives detailed instructions.)

Invalid usage.

4. annie-a

Annie version 1.13 (11 Mar 2010)
Purpose: Builds an annotation index for a collection, specified by <stem>, from a list of files in anchors.gz format.

Usage: annie-a <stem> [<stem_or_file> ...] [-phrasefile=<filename>] [-deb] [-hashbits <10..30>] [-maxlines <n>] [-wts <wt0> <wt1> <wt2> <wt3> <wt4>] [-stripstops] [-STOP=<filename>] [-canon] [-rejecturls] [-rejectnumeric] [-quicken] [-maxwds <i>] [-maxlen <i>] [-build_annou=on/off] [-build_lcache=on/off] [-nep_limit=0|1|2|3]
    <stem> must reference either a meta collection or a primary index.
    <stem_or_file> may be either a stem as above or the name of a file in anchors.gz format.
 In the case <stem> or <stem_or_file> is a meta collection, annie-a will look for the anchors.gz files from each of the component collections and use them for creating the annotation index for the collection specified by <stem>. If any anchors.gz file changes for a component collection, annie-a will need to be run again for the meta collection.

-quicken    improves query performance by using <coll id, doc id> pairs. The coll id is dependent on the sdinfo file, if the sdinfo file is changed annie-a will need to be run again for the meta collection with this option. It is recommended that the most recent collection is placed at the top of the sdinfo file.

5. annie-quicken

Purpose: Convert URL references in an annotation index into (component, docno) to speed query processing.

Usage: annie-quicken anno_stem index_stem

6. build_autoc

Purpose: Builds an auto-completion index file (.autoc) from a list of input files.

Usage: build_autoc stem input_file ... [-collection name -profile name] [-partials] [-label_organics] [-debug]
       e.g. build_autoc index example.csv
       where example.csv will be sorted and indexed into index.autoc.  Input_file(s)
       must end in .csv, .suggest, or .cfg

      -profile <name> - generate scoped .autoc file for the specified
                        profile. A previous run of build_autoc must have
                        been called with -index.
   -collection <name> - generate scoped .autoc file for the specified
                        collection. Both -profile and -collection need
                        to be specified when generating scoped suggestions
            -partials - this version allows multi-word organic
                        suggestions to be triggered either from the full
                        suggestions or from trailing word sequences.  E.g.
                        'big fat cat' triggered from 'fat cat' and 'cat' as
                        well as the full string. This option turns that on.
      -label_organics - present a category label for all the organic completions
      -sample <val>   - Sample postings of suggestion terms, to handle large
                        collections, <val> ranges 0 - 300; speeds up processing
                        with the effect of sampling the suggestions.
                        (1/val postings are used).

build_autoc supports the building of a single .autoc file from multiple input
files of the same or different types.  Files with very simple
format can be combined with hand-crafted files containing
complex actions.  Completion weights from a .suggest file
are automatically determined, while they can be manually specified
in a CSV file.  Completion weights from Best Bets
default to 100.

Note: this binary requires query processor options to be set via the environment
and the calling of it directly from the command line or via Funnelabck workflow
is not supported.
SUGGEST FORMAT
--------------
.suggest files built by build_spelling_index can be supplied as
input.  Reasons for doing this include taking advantage of an index
optimised for completion purposes; and integrating automated spelling
suggestions with hand-crafted entries.

CFG FORMAT
----------
Input files with .cfg suffix are no longer supported

CSV FORMAT
----------
Each line of a .csv file must contain eight fields (7 commas),
corresponding to: key, weight, display, display_type, category,
category_type, action, and action_type.  Fields except key and
weight may be empty.
Two meta characters are recognized within a field: backslash and
double quote.  These are handled as follows in the two cases:

(A) Unquoted text: A single backslash is not passed through, while
the character following it is passed through without applying
any tests. This means that a double backslash in input leads to
a single backslash in output and that commas or double quotes
preceded by a backslash do not have their normal meaning.
(B) Quoted text:  The double quotes beginning and ending a quoted
section are not passed through.  Within a quoted section a double
quote may be passed through by either doubling it ("") or by
preceding it with a backslash (\").

By these means it is possible to pass through HTML and/or JSON
containing quotes and/or commas

7. build_match_only_index

Purpose: create a match only index, which build_autoc can use  to build
profiled query suggestions.

Usage: build_match_only_index stem

8. build_spelling_index

Purpose: To build a spelling suggestion file (.suggest/.suggest2) for a collection.

Usage: build_spelling_index index_stem num_thresh [<metadata_class_names> [[<lexiweight>] [[<blacklist_file>] [<whitelist_file>]]]
       e.g. build_spelling_index index 2 [@,t,c]
       where the listed comma separated metadata class names,
       '@,t,c', are the ones to be scanned for
       suggestions.  '@' means use the .anno file.  '%' means use
       unfielded words from index. lex.  + means use phrases from
       index.phrases (if present).  If no fields are listed, "@,+,t,%"
       is assumed.
       num_thresh - minimum weight of suggestions recorded in suggest index
       lexiweight - controls the weight of lexicon suggestions relative
         to annotations.  wt = lexiweight * sqrt(df) (dflt lexiweight = 1.00)
       blacklist_file - manual list of suggestions which should NOT be included
       in the index. (one per line)
       whitelist_file - manual list of suggestions to include in the index.
       (one per line.)

9. csv2ctest

Purpose: To convert a tuning file in CSV format into a C-TEST file for use with e.g. FineTune.

Usage: csv2ctest: infile.csv [-utils=recip|-utils=equal] [-queryweights]

   Output in C-TEST format will be in infile.ctest

The input file is assumed to be a syntactically correct comma separated
value (CSV) file in which cells are separated by commas.  Double quotes
around all or part of a cell allow inclusion of commas.  The quotes are
stripped off before processing.  The input file may contain comment
lines starting with a hash.

The first column in infile.csv is always assumed to contain a query.
If no options are given, then the remaining columns contain desired
answer URLs for that query, in descending order of utility.  Utility
scores start at 4 and then gradually decline to 1: 4, 3, 2, 1, 1, 1 ...
This behaviour may be modified as follows:

  -utils=equal   - All of the answers are given equal utility values.
  -utils=sqrt    - Utility values drop off as 1/sqrt(rank).
  -utils=recip   - Utility values drop off faster -- as 1/rank.

  -queryweights  - if this is given, the second column is expected to
                   contain the numerical weight associated with the query
                   and the remaining columns contain the answer URLs

10. dump_annotation_file

Purpose: To display the contents of an annotation index in geek-readable form.

Usage: dump_annotation_file <annotationfile>

11. dump_autoc

Purpose: To display the contents of a query completion file in geek-readable format.

Usage: dump_autoc <stem|collection|autoc_file>

12. dump_suggestion_file

Purpose: To display the contents of a spelling suggestion file in geek-readable format.

Usage: dump_suggestion_file <index_stem>
 - dumps contents of <index_stem>.suggest

13. get_docnum_from_url

Purpose: Map a URL to that document's number within an index stem.

Usage: get_docnum_from_url <index_stem> <url>

Prints the docnum for a given URL to standard out.
Prints "notfound" if the URL is not found.
  <filestem>  - the common prefix (including path) of the index files
  <url>       - the URL to look up the document number for

14. get_url_from_component_document_pair

Purpose: Within an index, output the URL of the document identified by component number and document number.

Usage: get_url_from_component_document_pair <index_stem> <component_number> <document_number>
Warning:  Doesn't handle nested .sdinfo files.  (Hierarchical meta collections.)

15. get_url_from_docnum

Purpose: print the URL of a document in a primary index, given its URL. Inverse of get_docnum_from_url.

Usage: get_url_from_docnum <index_stem> <doc_num>

16. harvest_anchortext

Purpose: Extract a subset of entries in a list of anchors.gz files which match a specified pattern.

Usage 1: harvest_anchortext -targ|-text|-any|-source pattern anchor_text_file ...

Usage 2: harvest_anchortext -noneps <affiliates_file> anchor_text_file ...

Extracts a subset of lines in the anchor_text_files.  The composition
of the subset depends upon the match_type argument as follows:

-any - Any line (source or target) which matches pattern
-targ - Any target line whose URL target matches pattern
-text - Any target line whose anchortext matches pattern
-source - Any source line which matches pattern

+ In usage 2, links within the same SECD (single-entity-controlled-domain)
  are suppressed, as are links between affiliated pairs of hosts listed in
  the affiliates_file. If there is no affiliates_file, use '-'.

+ Whenever a target line matches the corresponding source line is also output.
+ Whenever a source line matches its corresponding target lines are also output.
+ NOTE: nepotistic links are included unless -noneps is used.

17. hierarchical_navpaths

Purpose: Extract hierarchical navigation paths from a list of anchors files.

Usage: hierarchical_navpaths <stem> [-verbose] [<anchor_text_file> ...]

Reads <stem>.anchors.gz, plus any additional anchortext files to
identify hierarchical navigation paths (HNPs).  These are output
to <stem>.hnp.anchors.gz in standard anchors.gz format:
<target_url> --- [H]<concatenated anchors from path>

+ NOTES:
    1. inter-host links are ignored.
    2. -verbose prints the actual HN paths to stdout.
    3. All targets in the .hnp.anchors file have http://hnp as source.

Warning: Not ready for use. Development of this utility is incomplete.
Purpose: Analyse a list of anchors.gz files and report on frequencies of inter-host links.

Usage: host_host_link_counts [-targ|-source <pattern>] [-report] <stem> [anchor_text_file ...]

Reads the anchor_text_files and outputs a table of host-host links,
in descending order of link count.   By default, all lines are
processed, but a pattern can optionally be applied to either
targets or sources.

If -report is given, short and full HTML reports will be generated.

-targ - Process only links whose target host matches pattern
-source - Process only links whose source host matches pattern

Nowadays, the first option not starting with a - is an index stem.
A file <stem>.hosts is created with a table of host-related feature
scores which can be used in ranking.  The order of entries must
correspond to the hostnum order assigned by padre-iw.

+ NOTE: within-host links are excluded.

19. padre-arg-sw

Purpose: To help with conversion of padre-sw argument lists from old to key=value format

20. padre-cc

Purpose: To build an index.collapsig file to permit use of collapsed rankings.

Usage: padre-cc <index_stem> [-collapse_control=<string>] [-debug=on]
   Utility for building a .collapsig file of collapsing
   signatures.  If no control_string is given, a one-column
   file is built using the signatures from the .textsig file.

   The collapse_control string must consist of sets of sequences of
   metadata class names. Each set should be surrounded by square
   brackets, and sets should be separated by commas. Metadata class
   names are the elements of the sets and must be separated by
   commas.

    The characters $ and # may be used as special metadata class
   names and represent document summarisable text and
   document URL respectively.

     In future, it is planned to allow special metadata class
   names to be followed by a regular expression,
   indicating that only the part of the metadata string which matches
   the regex should be used in calculating the signature.

   Example current control string: '[$],[t,a]'.  In this case the .collapsig
   will have two signatures per document: Column 0 is the normal document
   signature and column 1 is a signature derived from the concatenation of
   metadata fields t and a, in that order.

21. padre-ct

Purpose: Report on the titles in a PADRE index.  Eventually to improve them.

Usage: padre-ct <index_stem>

Warning: Development of this utility is not yet complete.

22. padre-cv

Allowed options:
  -h, --help                          produce help message
  -v, --verbose                       be verbose
  -e, --engine arg (=hnswlib)         vector storage engine to use (default: hnswlib)
  -a, --algorithm arg (=hnsw)         algorithm for vector storage (default: hnsw, alternate: bruteforce)
  -d, --dimensions arg (=384)         dimensions for each element in vector storage (default: 384 - correct size for e5 small v2 model)
  --max-elements arg (=1000000)       maximum elements (documents) for vector storage. (default: 1000000)
  -M, --M arg (=64)                   internal dimensionality of data. Affects memory usage. (default: 64)
  -c, --ef-construction arg (=10)     Controls index search speed/build speed tradeoff (default: 10)
  -s, --space-type arg (=ip)          Space type options: either l2 or ip (inner product, default)
  -f, --db-filename arg               filename of vector storage database (required)
  -o, --overwrite-db                  overwrites the file specified in db-filename, instead of loading it
  -i, --input-file arg                input database (required)
  --hash-collision-retries arg (=10)  hash collision retries (default: 10)
  --log-level arg (=info)             log level to use: info|trace|debug|warn|error|fatal|all|off
  --chamber-size arg (=33554432)      Chamber size for uncompressed data, used by warc decompressor, in bytes (default: 32*1024*1024 bytes)
  -g, --use-config                    Use config file for database as the actual parameter values for the pipeline

23. padre-cw

Purpose: To check the correctness of an index, compare two indexes, or display postings for a term within and index.

Usage0: padre-cw -v - print PADRE version
Usage1: padre-cw stem1 stem2 [-io] - Compare two indexes.
        -io means ignore diff.s in offsets into .idx
Usage2: padre-cw stem1 -show term - show postings for term.
        Also shows term before and afterward. (if applic.)
Usage3: padre-cw stem1 -check [-stemsuff] [-show_all]- Check index files for stem1 (default)
        use -stemsuff <suffix> to supply an additional suffix for the .idx and dct files
        use -show_all to print every terms summary information.

24. padre-di

Purpose: To display the metadata for documents in an index.  (main purpose)

Usage: padre-di <index_stem> [-check]|[-trecids]|[-metao [<docno>]][-meta [<pattern>] ] | [-metad [<pattern>] ]
-check - check whether the document table appears to be internally consistent

-trecid - make a mapping between trec DOCNO stored in title field and URL

-meta [<pattern>] - print title and metadata information for each document
                    whose "URL" contains pattern (case-insensitive)
                    If no pattern is given, all docs are shown, in collection order.

-metad          - as for -meta but show document numbers.

-metao          - as for -meta but show all documents, in collection order starting
                  from docno (default zero).

-doc_per_meta   - prints in JSON the number of documents each metadata class appears in.

default - read in URLs and look them up, using sorted table

25. padre-do

Purpose: Print a permutation of the document numbers in an index corresponding to descending static score.

Usage: padre-do <stem> <docorderfile> [-deb] [-cool_param ...]
Output is a list of docnums in descending order of cool score,
printed to docorderfile.  cool_param values are expected to lie in 0 - 1.
Default values are the same as for padre-sw though. Of course,
query-dependent cool values cool0, 7, 12, 15, 16, 17, 18, 19 are ignored
because there is no query.

26. padre-fl

Purpose: Display or operate on the document flags in an index.

Usage1: padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry|-quicken]

Usage2: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill

Usage3: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill

Usage4: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR

Usage5: padre-fl <index_stem> -kill-docnum-list <file_of_docnums>

Usage6: padre-fl -v

Note: Specify '-' as the file of url patterns to supply a single URL to standard input.

27. padre-gs

Purpose: Display or manipulate document gscopes in an index.

Usage0: padre-gs -v|-V|-help   # print version info or detailed help
              on types of instructions and on program operation.
Usage1: padre-gs index_stem -clear   # clear all gscopes

Usage2: padre-gs index_stem -show   # show all gscopes

Usage3: padre-gs index_stem file_of_instructions [-separate] [other_gscope]
        [-regex|-url|-docnum] [-verbose] [-quiet] [-dont_backup]

Where:
      * index_stem may also be the name of a collection
      * file_of_instructions may be '-' to accept instructions from stdin
      * -separate indicates that gscope changes should be made to a
        copy of the .dt file first, and then copied over the original file
        when changes are complete. In this mode the number of gscope bits
        can NOT be expanded you will be required to ensure enough is available.
      * other_gscope specifies a gscope to be set on documents which
        end up with no gscopes set.
      * By default instruction patterns are expected to be regexes
        but this may be made explicit with -regex or altered with -url
        or -docnum.  Use padre-gs -help to obtain more information about
        instruction formats and pattern types.
      * gscope names may consist of alphanumeric ascii characters up to a length
        of 64 characters.
      * -dont_backup prevents backing up of the .dt file
      * -quiet don't show the before and after summary of gscopes

28. padre-i4u

Purpose: Display aggregated information about a URL from a PADRE index.

Usage: padre-i4u -v | padre-i4u stem=<stem_or_collname> [fields=<alnum_string>] [debug=<int>] [iters=<int>] [format=json/old] [coll=collection_name] url=<url> ...
Note: The functionality is implemented by a dynamic library which is usually called directly.
coll= option should be set to the name of the collection corresponding to the stem= option

29. padre-iv

Allowed options:
  -h, --help                          produce help message
  -v, --verbose                       be verbose
  -e, --engine arg (=hnswlib)         vector storage engine to use (default: hnswlib)
  -a, --algorithm arg (=hnsw)         algorithm for vector storage (default: hnsw, alternate: bruteforce)
  -d, --dimensions arg (=384)         dimensions for each element in vector storage (default: 384 - correct size for e5 small v2 model)
  --max-elements arg (=1000000)       maximum elements (documents) for vector storage. (default: 1000000)
  -M, --M arg (=64)                   internal dimensionality of data. Affects memory usage. (default: 64)
  -c, --ef-construction arg (=10)     Controls index search speed/build speed tradeoff (default: 10)
  -s, --space-type arg (=ip)          Space type options: either l2 or ip (inner product, default)
  -f, --db-filename arg               filename of vector storage database (required)
  -o, --overwrite-db                  overwrites the file specified in db-filename, instead of loading it
  -i, --input-file arg                input file for document source (required)
  -w, --warc                          input file is a warc file
  -u, --embedding-api-server-url arg (=http://127.0.0.1)
                                      URL for embedding API server (default: 'http://127.0.0.1/')
  -r, --route-embedding-api-server arg (=/embeddings/v1)
                                      route for embedding api server (default: '/embeddings/v1')
  -m, --model arg (=intfloat/e5-small-v2)
                                      embedding API NN model (default: 'intfloat/e5-small-v2')
  --hash-collision-retries arg (=10)  hash collision retries (default: 10)
  --dry-run                           Dry run, do not actually use/connect to an embedding server
  --log-level arg (=info)             log level to use: info|trace|debug|warn|error|fatal|all|off
  --chamber-size arg (=33554432)      Chamber size for uncompressed data, used by warc decompressor, in bytes (default: 32*1024*1024 bytes)
  --strip-html arg (=0)               Strip HTML tags from documents (0 = no strip (default), 1 = strip *.htm* files only, 2 = strip all files (dangerous))
  -p, --paragraph                     paragraph extraction, extracts text based off <p> html tags
  -x, --xfunnelback-strip             strip X-Funnelback headers from the warc body (with -w switch) or document (no -w switch)
  -z, --input-cache-db arg            Input cache database filename
  -g, --use-config                    Use config file for database as the actual parameter values for the pipeline

30. padre-iw

Purpose: Index a collection of documents

Usage1: padre-iw -V|-help|-helpadoc|-ixform            (print version or help info.)
Usage2: padre-iw [-f|-tar|-reo<pf>] <dir>|<file>|<url> <filestem> [<option>|<tfdir> ...]
Usage3: padre-iw -secondary_update <dir>|<file> <filestem>
<pf> is a text file containing a permutation of the document numbers in the original index.
<dir> is a hierarchical directory of optionally gzipped files.
<file> contains a list of names of optionally gzipped files.
-f says that <file> is a single datafile to be indexed. For historic reasons
-tar means the same as -f.
     Files to be indexed may be tar or WARC files (optionally gzipped).
     Note that individual files in a tarfile are expected to be uncompressed.
     text.  Gzipped files, unfiltered PDFs etc. are not supported yet.
-reo says that <file> is the stem of a previous index to be
     reordered and reindexed.   <pf> is a text file containing
     a permutation of the document numbers in the original index.  Eventually,
     it may be possible to compute the permutation internally. For now, it
     must be specified via <pf>.
<filestem> prefixes the names of output files.
<tfdir> is the dir. in which tmp files will be writ.
-secondary_update creates a secondary index using the data directory specified, and using the options used in creating the primary index.

Available options:

A. Getting information about PADRE and its operation.
  -V       - Print PADRE version number and exit.
  -ixform  - Print index format version created by this indexer exit.
  -help    - Print this list and exit.
  -helpadoc - Print this list in asciidoc format and exit.
  -debug   - Generate debugging output.
  -show_each_word_indexed - For debugging.  Show each word occurrence (with field) as it is indexed.
  -show_each_word_to_file - For debugging.  Print each word occurrence (with field) to <filestem>.words_in_docs
  -hashlog - Create a .hashlog file with incremental hashing stats.
  -quiet   - Use terse logging.
  -ankdebug - Generate debugging output relating to anchortext.
  -termdeb<term> - print debugging messages relating to the indexing of <term>.

B. Controlling what is indexed.
  -nometa   - Don't index any metadata except t, d and k (titles, dates and links).
  -nomdsfconcat - Don't concatenate strings in the mdsf file. Record first only. (Others are still indexed.)
  -diwimuu - Don't index words in made-up URLs (those constructed from filepath).
  -dias    - Don't index link anchor source as part of source documents (<a> only).
  -ibd     - Index all documents even if they appear to be binary.
  -ixcom   - Index words in HTML and XML comments.
  -select<num1>,<num2> - Index every num1th file/bundle starting from
            num2th(from zero).
  -select-doc-in-bundle=<interval>,<offset> - Index every <interval> document
            within a bundlestarting from <offset> (which starts at zero).
            Only works with warc store.
  -tarpat<regex> - Filenames in a tarfile being indexed must match regex. Default is match-everything.
  -csv=<fsep><skipfirst>[<quote>] - Deprecated. Use the CSV to XML filter instead. Files which are
                            not clearly something else
                            are assumed to be CSV format.
                          fsep is ascii field separator, typically comma.
                            (tab is represented by t.)
                          skipfirst is either y or n, telling padre whether the first
                            line in a CSV file should be skipped.
                          quote is the character used to quote strings in fields
                            which may contain separators.  (You probably have to
                            escape it on the command line.)  If not specified,
                            no quote character is defined.  To include a quote
                            character within a quoted section, the quote may be doubled.
  -csv_fields=<comma_separated_descriptor_list> - Deprecated. Use the CSV to XML filter instead.
                          This is a list of comma
                          separated descriptors describing how to index each column of
                          the csv file.
                          To index terms in a column as document text use '-'.
                          To index terms in a column as metadata use the format:
                               <metadata class name><content type>
                          To skip terms in a column use 'X'.
                           For example: 't1,-,X', would set the first column to title,
                          the second column would be indexed as document content and the third
                          column would b skipped.
                           Content-type defined in this argument should be the same as the content type in
                           the metadata mappings
  -check_url_exclusion=<on|off> - URLs matching url_exclusion pattern will not be searchable.  (Default on.)
  -url_exclusion_pattern=<regex> - exclusion pattern to use if URLs are vetted.  (Default 'file://$SEARCH_HOME/')
  -filepath_exclusion_pattern=<regex> - exclusion pattern to use if files are to be excluded from indexing
            on the basis of the filepath.  If applicable, this is more efficient than excluding by URL
            because the URL can't be finally determined until the content has been scanned. (Default: not set)
  -index_subversion_dirs - Normally the .svn directories created by
       the subversion version control system are not indexed. Override this default.

C. Controlling how things are indexed.
  -noax    - Don't conflate accents.
  -unimap=<mapname> - specify a Unicode mapping to be applied when indexing
                      and when query processing. Supported values:
                      tosimplified, and totraditional. (Chinese only.)
  -deutsch=<i> - How much extra processing is done for umlaut and sz.
             0 - none.  München is indexed as München and Munchen
             1 - München is indexed as München, Muenchen and Munchen (Dflt)
              2 - As for 1 but also Muenchen is indexed as München, Muenchen and Munchen
             (As a side-effect to allow for compounds, SORT_SIGNIF is increased to 40
  -nz=<i> -      How much extra processing is done for Māori.
             0 - none.  Māori and Mäori are indexed as Māori or Mäori resp. and Maori (Dflt)
             1 - Māori is indexed as Māori, Maaori and Maori
                 Mäori is indexed as Mäori, Māori, Maaori and Maori
  -no_cjkt_grams - Suppress the indexing of bigrams/unigrams in CJKT text.  It is assumed that
                   said text has been pre-segmented into words, and that normal word-based indexing is needed.
  -QL_depth=<i> - Activate quicklinks on default pages of up to depth i. Use internal QL                  defaults.  (Dflt 0 = Off)
  -QL_config=<f> - Activate quicklinks. Read quicklinks configuration options from
                  file f.
  -docscan_depth=<i> - When trying to determine doc type and charset
                  indexer will look up to i char.s into the fdoc.  (Dflt 20480)
  -forcexml - Use the XML parser on all documents.
  -case    - Store case information in postings. Currently unsupported. Note that setting
             this reduces the approximate max number of unique terms from ~950M to ~240M.
  -SORTSIG<num> - How many [UTF-8] characters in a word are significant. Default 30
  -dilw    - Don't index words or use words in summaries that are longer
             than what is set by -SORTSIG.

D. Controlling metadata indexing.
  -xml-config=<file> - <file> specifies a file defining XML indexing configurations in json format.
  -MM=<file> - <file> specifies a file defining metadata mappings for both HTML and XML documents.
  -XMF<file> - (Deprecated) <file> specifies a file defining XML field mappings.
  -MMF<file> - (Deprecated) <file> specifies a file defining meta tag mappings.
  -ifb       - Index a special word '$++' at the start and end of each metadata field (on by default).
  -noifb     - Do not index a special word '$++' at the start and end of each metadata field.
  -facet_item_sepchars=<string> - Which chars are used to separate metadata facet items.  [Dflt '|']
  -map[<f>]    - Map anchor text in source file to metafield f.  If <f> is absent,
                 outgoing anchortext is unfielded content. (dflt <f> absent)
  -EM<file>  - <file> is a file of external metadata, multiple files may be supplied by setting this multiple times.
`-externalMetaErrorThreshold=<num>`:: `<num>` is a configurable percentage for the error rate allowed in processing the external metadata file(s). It varies from 0 to 100. Default 10

  -NIM       - Ignore explicitly specified internal metadata.
  -collfield=<f> - Index the name of a collection as metadata in each doc and assign to field f.
  -collection_name= - Set the name of the collection being indexed.
  -metadata_topk_capacity=<I> - Sets the maximum number of metadata names or XML paths padre
                                will keep track of for counting the most frequent metadata or
                                xpath that could be mapped.
  -metadata_topk_k=<I> - Sets the number of the most frequent metadata names or XML paths padre
                         should report on after indexing.

E. Controlling link and anchortext handling.
  -noank_record - Don't extract, record or index anchortext.
                  - .anchors.gz file not processed.  No link counts possible.
  -noank_index    - Extract and record but don't index anchortext.
                  - .anchors.gz file can be post-processed by annie-a
  -noank      - Temporary synonym for -noank_index.  Deprecated.
  -nocanon - Don't canonicalize URLs when storing URLs or matching anchortext.
 -canon.anchor_collapse=<on|off> - Controls whether PaDRE should canonicalise URLs with fragments (anchors)
            on:  PaDRE will drop any characters with the anchor symbol (#).
            off: PaDRE will treat URLs with anchors as unique URLs.
  -dpdf    - Produce but don't process the anchors distilled file.
  -nep_action=<0|1|2>  - Action to take for nepotistic links.
                  0 - treat the same as other links.
                  1 - ignore links of types greater than nep_limit.
                  2 - limit the number of repetitions of links of types
                      greater than nep_limit. (dflt)
  -nep_limit=<0|1|2|3>  - Ignore nepotistic links of types greater than the limit.
                  0 - unaffiliated links from outside the target domain.
                  1 - links from a different host.
                  2 - links from the same or a closely affiliated host.
                  3 - dynamically generated links from such a host.
  -nep_cachebits=<i>  - Don't let the low-value link cache grow above 2^i
  -noaltanx - Don't index image alt as anchortext when an image is an anchor.
  -nosrcanx - Don't index image src as anchortext when an image is an anchor.
  -BL<f>   - <f> is a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist).
  -AD<f>   - <f> is a file of SECD (single entity controlled domain) affiliations.
             e.g. griffith.edu.au --> gu.edu.au
             Links to an affiliated SECD are classified as within-domain.
  -RP<f>   - <f> is a file of CGI parameters which should be removed from source and target URLs.
           - padre generates a regular expression from the lines in <f>.
           - if <f> is "conf_file" the regex be taken from crawler.remove_parameters;
             in the FunnelBack config file.
  -A<pat>  - <pat> is an acceptable link target pattern.
             - URLs not matching pat will not be stored in anchors.gz file.
             - if pat is "conf_file" pat will be taken from include_patterns
               in FunnelBack config file.
  -F<file>  - *<file> is an additional anchor text file.
  -FN<file>  - Like -F but source URLs should need not be looked up.
  -RD<dir> - *<dir> is a directory in which to look
         - for redirects and duplicates files.
         - (produced by FunnelBack etc. & PADRE).
  -igmaf   - Ignore main anchors file.
  -mule<n> - Discard links to URL targets longer than <n> chars. Default is no limit.
  -rmat    - Record targets of failed anchor lookups via stdout.
 -create_phrase_metadata_terms=<b> - Enables the creation of phrase terms like "$++ foo bar $++" in the dictionary
                                     for metadata. These phrase terms can be used to speed up queries like a:"$++ foo bar $++".
                                     Phrases will only be created if indexing of field boundaries is enabled, which it is by default.
                                     Disabling may reduce indexing time and index size.

F. Controlling which index files are generated.
  -nomdsf  - Suppress generation of the .mdsf file.
  -nolex   - Suppress generation of the .lex file.
  -noqicf  - Suppress computation of QIC features and .qicf file.
  -nohostf - Suppress computation of host features and .ghosts file.
  -cleanup - Remove superfluous files from the index directory after index has completed.

G. Setting size limits.
  -GSB<n>  - How many gscope bytes to allow for. Default/Min: 8/2.
NOTE: This setting is no longer required as gscope bits are now auto-sized.
  -big<N>    - Multiply word table sizes by 2^N from base of 256K.  Default table size is 8M (ie. -big5).
  -small   - Divide word table sizes by 4 from base of 256K (i.e. use 64K).
  -chamb<num> - Set decompression chamber size to <num> MB. Default 32
  -RSDTF<num> - Set maximum characters in description & title fields in .results to <num>.  Default 256.
  -RSTAG<num> - Set number of bytes to reserve for tags in .results to num. Default 0.
  -RSTXT<num> - Set maximum characters in summarisable text per doc in .results to <num>.  Default 50000, maximum of 10000000.
  -W<num>  - Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM). Default for a 64bit system is 2800
  -MWIPD<num>- Maximum words indexed per document (excluding anchors). By default all words are indexed
  -maxdocs<num>- Maximum no. of documents to index. Others are ignored.
  -mdsfml<n> - Set the number of bytes used for MetaData Summary Field Maximum Lengths. Fields larger than this number will be truncated. Default is 2048.
  -lock_string_mod_mode=[legacy|raw] - Sets how padre should modify the lockstring before it is stored, 'legacy' mode which removes some characters, replaces unquoted commas into new lines and removes consecutive new lines. 'raw' mode stores the lock string as is up to the first null.
  -99%    - Limit on how full the word hash table can get.

H. Special indexing modes.
  -duplicate_urls=flag|ignore (Default is flag.)
            - Documents whose URL checksum is identical to that of another document
              are normally flagged and suppressed from results.
  -urlchecksums=case_sensitive|case_insensitive (Default is case_insensitive).
  -paidads  - If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
  -doc_feature_regex=<Regex> - Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX.
             The presence or absence of this feature can be used in the ranking function, controlled by cool29 and cool30.
  -iolap   - Overlap reading of bundles with processing them.
  -utf8input - Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.)
  -isoinput - Assume all input files whose charset is not specified are ISO_8859-1 encoded.
  -force_iso - Forcibly assume all input files are ISO_8859-1 encoded.
  -URLP<str> - When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it's own URL with a BASE HREF element, such as in local collections)
  -lmd       - HTTP LastModified date takes priority over metadata dates.
  -lmd_never - Completely ignore HTTP LastModified dates.
  -ignore_link_rel_canonical - Ignore canonical URL declarations in HTML link elements.
  -ignore_noindex - Ignore robots noindex meta element.
  -DT<str>   - Interpret <str> as start of new doc within bundle. (Not a regular expression).
               (note that there is a separate mechanism for XML).
  -annie[<exec>] - After normal indexing is complete, attempt to build an annotation index (annie)
                   and a spelling suggestion file.
                   Default executables are annie-a and build-spelling-index from whence padre-iw was run.
  -speller[<exec>] - Allows the explicit specification of a spelling_index builder to run after annie-a.
  -spelleroff - turns of spelling-index building even if annie-a runs.
  -spelling_threshold<i> - Annotations with fewer than i occurrences will not be considered
                           as spelling suggestions.  (dflt 1)
  -bigweb  - Space saving option for bundled large crawl indexes. Roughly equivalent to:
             -nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2
             -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet
              * A shorter average wordlength is assumed.
              * You can add e.g. -Axxx.com to cut anchor processing time.
              * (Don't forget to make dupredrex.txt in index directory.)

I. Miscellaneous options.
  -O<name> - <name> is the name of this organisation.
  -T<path> - Specify a large temporary filespace for use by the indexer.
  -redis_host=<str> - Hostname/IP of a Redis server where progress status should be written
  -redis_port=<i> - Port of the Redis server. Default is 6379

S. Security options.
  -security_level=<i> - Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop.
  -security_mindocs=<i> - Must be at least this number of docs with at least one lock.

     *** See also url_exclusion options in Section B above.

31. padre-mi

Purpose: To merge a list of PADRE indexes into a single such index.

Usage: padre-mi outstem instem instem ... [-overwrite] [-cleargscopes]
   -overwrite overrides protection against destroying existing outstem
   -cleargscopes clears all set gscopes from the resulting index

Make a merged index (outstem) from the list of at least two input indexes.

This version assumes that input indexes have exactly the same format,
i.e. that the index format strings are the same and that they have
identical numbers of gscope bits, numerical metadata fields and so on.
Future versions may check this compatibility, but currently exact compa-
tibility is assumed.  All manner of pestilence may descend upon you if
you use padre-mi on incompatible indexes.  You have been warned :-)

32. padre-qi

Purpose: To setup a query-independent-evidence file for use in query processing.

Usage: padre-qi index_stem file_of_url_patterns dflt_score [profile_name] [-verbose]

 - if a profile name is given, qiefile will be stem.qie_profile

Each URL in the index is matched against the patterns, in the
order in which they are listed in the pattern file.  Once a match
is found, matching ceases for that URL.  This behaviour can be
exploited to apply a general pattern (later in the file) if
no more specific pattern (earlier in the file) matches.
  To achieve exact matching use ^ (matches start of URL) and
$(matches end of URL

Lines in the patterns file consist of:

<qie score> <url-pattern>

qie-score  - a floating point number (assumed normalised to the range 0-1),
             specifying the qie score to be applied.

url-pattern   - a perl5 regular expression to be matched against name
                 strings in the .urls file (usually URLs).

Example:
0.25  ^(https://)?[^/]*nsw.gov.au/
1.0   ^(https://)?[^/]*wa.gov.au/
0.25  ^(https://)?[^/]*sa.gov.au/
0.25  ^(https://)?[^/]*nt.gov.au/

33. padre-qs

Purpose: To generate query suggestions given an index and a partially typed query.

Usage: padre-qs -v | padre-qs stem=<stem>|collection=<collname> partial_query=<partial_query> [alpha=<f>] [show=<d>] [fmt=xml|json|json++] [callback=foo] [sort=0|1|2] [profile=<profile>] [debug=0|1|2|3], e.g.
Note: The functionality is implemented by a dynamic library which is usually called directly.
padre-qs stem=/opt/funnelback/data/abc/live/idx/index partial_query=kevi alpha=0.5 show=10
  - sort=0 (by weight), 1 (by length), 2 (in alphabetic order), 3 (by weighted combo of weight and length).
  - fmt=json => simple JSON array of suggestion strings;
       =json++ => full JSON object with all fields shown.
  - callback=foo => In JSON or JSON++ output will wrap
         the response with the supplied callback (for JSONP).
  - show=<d> => how many suggestions to show.
  - alpha=<f> => if sort=3, score = alpha * weight + (1 - alpha) * length_score.

34. padre-query-parser

Usage: padre-query-parser -query=[Query to canon]
Returns to standard out a mostly canonicalised query.

35. padre-rf

Purpose: Generate a relevance-feedback query given a list of relevant documents in a collection.  Powers the Explore feature.

Usage: padre-rf -v | padre-rf -idx_stem=<index_stem> [-exp=<7..50>] [-comp=<comp_num> -dox=<docnum_list> | -url=<url>] ...

Details of available options:
R. -collection=<X>                      - The name of a collection, either meta
                                          or primary.
R. -script=<S>                          - Name of the CGI script to which
                                          padre-rf.cgi should redirect. (dflt "(null)")
R. -idx_stem=<Y>                        - The index stem for this collection,
                                          either meta or primary.  [Not CGI]
R. -exp=<I>                             - Maximum complexity of generated query
                                          (no. of words). (range 7 - 50) (dflt 10)
R. -deb_rf=<I>                          - Activate debugging output.  Higher
                                          values give more verbose output. (range 0 - 10) (dflt 0)
R. -comp=<I>                            - Component number within a meta
                                          collection. (range 0 - unlimited) (dflt 0)
R. -dox=<D>                             - Comma separated list of document
                                          numbers within current component.
R. -url=<E>                             - URL of document to be included in
                                          generation of RF query.

36. padre-show

Purpose: Display the contents of a padre index file in readable format.

Usage: padre-show <padre_index_file>
 -- if poss. displays contents of index file in text form.

 e.g. padre-show index.urls

37. padre-sk

Purpose: Create a skip block index from a regular padre index.

Usage: padre-sk <stem> <skip>

  Output will be in <stem>.idx_skip and <stem>.dct_skip

  <stem>   String: the index stem to use
  <skip>   Integer: the minimum number of postings between each skip block

38. padre-sr

Purpose: Display all or part of the content of the .results file. (Title, URL, Description metadata, and candidate sentences for summary generation.

Usage: padre-sr stem|results_file [-titleonly] [-unco] [-ifff|-embedded|-text|-html|-textsigs]
      [starting_doc|starting_url] [urlpat=<regex>] [num_docs_to_show]

padre-sr sequentially reads the .results file and outputs all or part of the file to stdout in a choice of formats:

     . html (default)
     . embedded (incomplete html suitable for embedding in another html document)
     . text
     .textsigs (generate stem.textsigs file suitable for neardup detection.)
If -titleonly is given only the document titles are output. (not applic. to textsigs)

Use -unco to specify that the input doc. is in old uncompressed format.

If a starting document number or URL is given, output commences
only when that point in the file is reached.  Output continues
to the end of the file unless num_docs_to_show is given.
If urlpat= is given, only documents whose URL matches the pattern are
considered for display.  Case-sensitive unless specified otherwise
in the pattern.  Don't include 'http://' in the pattern.

39. padre-sv

Allowed options:
  -h, --help                          produce help message
  -v, --verbose                       be verbose
  -e, --engine arg (=hnswlib)         vector storage engine to use (default: hnswlib)
  -a, --algorithm arg (=hnsw)         algorithm for vector storage (default: hnsw, alternate: bruteforce)
  -d, --dimensions arg (=384)         dimensions for each element in vector storage (default: 384 - correct size for e5 small v2 model)
  --max-elements arg (=1000000)       maximum elements (documents) for vector storage. (default: 1000000)
  -M, --M arg (=64)                   internal dimensionality of data. Affects memory usage. (default: 64)
  -c, --ef-construction arg (=10)     Controls index search speed/build speed tradeoff (default: 10)
  -s, --space-type arg (=ip)          Space type options: either l2 or ip (inner product, default)
  -f, --db-filename arg               filename of vector storage database (required)
  -u, --embedding-api-server-url arg (=http://127.0.0.1)
                                      URL for embedding API server (default: 'http://127.0.0.1/')
  -r, --route-embedding-api-server arg (=/embeddings/v1)
                                      route for embedding api server (default: '/embeddings/v1')
  -m, --model arg (=intfloat/e5-small-v2)
                                      embedding API NN model (default: 'intfloat/e5-small-v2')
  --hash-collision-retries arg (=10)  hash collision retries (default: 10)
  --log-level arg (=info)             log level to use: info|trace|debug|warn|error|fatal|all|off
  -q, --query arg                     query for search vector to process
  --topk arg (=10)                    top K number of documents to return
  --chamber-size arg (=33554432)      Chamber size for uncompressed data, used by warc decompressor, in bytes (default: 32*1024*1024 bytes)
  --disable-embeddings                disables printing of embeddings details
  -g, --use-config                    Use config file for database as the actual parameter values for the pipeline

40. padre-sw

Purpose: Process queries using a PADRE index.

Usage: padre-sw <filestem> [option ...]
<filestem> - the common prefix of all the index files, or possibly
             the name of a Funnelback collection.

Available options:

A. Getting information about PADRE and its operation.
  -V       - Print version number and exit.
  -ixform  - Print index format version expected by this
             query processor and exit.
  -help    - Print this list.

Notation:
---------
 <B> - Boolean.  Will be interpreted as TRUE unless arg is 'off',
       'false' or '0' (case insensitive).
 <I> - Integer.  eg. 7 or 100000. Whole number in specified range.
 <F> - Number.  e.g. 1 or 0.537 or 99.5.  Some inputs of this type
       are constrained to lie within [0.0 - 1.0].
 <C> - Character.  e.g. a or A or : A single character.
 <S> - String.  eg. abc or "a b c".  Quotes needed around the
       key and value if spaces or punctuation included - for
       example: "-optionname=a b c".
 <K> - Key/value pair. These options take a key and a value,
       for example, -optionname.KEY=VALUE

I. Contextual navigation options:

   -categorise_clusters=<B>             - Whether contextual navigation suggestions are grouped by type.
   -cnto=<F>                            - Set contextual navigation time-out to s seconds (s floating point). processing
                                          may be omitted entirely if elapsed time for a query already exceeds s seconds.
                                          (dflt 1.0). (range 0.000000 - unlimited)
   -contextual_navigation=<B>           - Whether or not to activate the contextual navigation system.
   -contextual_navigation_fields=<S>    - String s lists the metadata fields, separated by commas surrounded by square
                                          brackets, to scan for contextual navigation suggestions. (dflt '[c,t]'). Note
                                          that scanning of document text can be suppressed by including a minus, for
                                          example '[-,c,t]'.
   -max_phrase_length=<I>               - Maximum length (in words) of contextual navigation suggestions. (range 3 - 7)
   -max_phrases=<I>                     - After this number of candidate phrases have been checked, contextual navigation
                                          processing will stop. (range 0 - unlimited)
   -max_results_to_examine=<I>          - Maximum number of search results to scan for contextual navigation suggestions. (range 0 - 200)
   -site_max_clusters=<I>               - Maximum number of site clusters to present in contextual navigation. (range 0 - unlimited)
   -topic_max_clusters=<I>              - Maximum number of topic clusters to present in contextual navigation. (range 0 - unlimited)
   -type_max_clusters=<I>               - Maximum number of type clusters to present in contextual navigation. (range 0 - unlimited)
J. Geospatial options:

   -geospatial_ranges=<B>               - Calculate geospatial distance from origin and bounding box ranges when
                                          geospatial data is configured and available.
   -maxdist=<F>                         - Exclude results not within <f> km of origin. (range 0.000000 - unlimited)
   -origin=<S>                          - <lat,long> Set origin to lat, long (floating point degrees).
K. Informational options:

   -canq=<B>                            - Write reordered queries to log.  (dflt off)
   -countIndexedTerms=<S>               - Metadata fields to have their indexed terms counted in the result set (DAAT
                                          only). Unlike rmcf multiple term occurrences in a single document are counted
                                          e.g. if metadata 'author' has 'Bob Ada|Bob|Bob' in two documents the resulting
                                          counts would be 'Ada': 2, 'Bob': 6. As this counts indexed terms long terms may
                                          be truncated depending on the indexer options used. To count fields 'a' and
                                          'c', set this to '[a,c]'.  [Not CGI]
   -countUniqueByGroup=<S>              - Counts the number of unique metadata values grouped by another metadata.
                                          Syntax: -countUniqueByGroup=[classToCount]:[groupBy],[classToCount]:[groupBy].
                                          Example: -countUniqueByGroup=[author]:[project] would show us the number of
                                          authors contributing to each project. classToCount is a regex and will be
                                          expanded to all matching metadata classes e.g. [author.*]:[project] might
                                          expand to -countUniqueByGroup=[author]:[project],[authors]:[project].  [Not CGI]
   -count_dates=<S>                     - Report facet counts for dates such as 'today', 'last week', 'this year'. Note
                                          that date categories may overlap.  Only value currently supported is 'd'.
   -count_urls=<I>                      - Display counts of results grouped by the URL path (Up to depth i). If <I> is 0,
                                          then the default value is used. Dflt 5. If <I> is not present count urls is
                                          turned off.  [Not CGI]
   -docsPerColl=<B>                     - Show the number documents each collection contributed to the result set.
   -rmcf=<S>                            - Metadata fields to have their words counted in result sets (fields representing
                                          facets). If metadata 'author' has 'Bob Ada|Bob|Bob' in two documents the counts
                                          would be 'Bob Ada': 2 'Bob': 2. To count fields 'a' and 'c', set this to
                                          '[a,c]'.
   -rmrf=<S>                            - Numerical and geospatial fields listed will have their ranges calculated in
                                          result sets. To see the ranges of field 'height' and the bounding box
                                          geospatial field 'X' set this to '[height,X]'.
   -showtimes=<B>                       - Print elapsed times for each stage of query processing.
   -sum=<S>                             - The sum of a numeric metadata in result set. Syntax: -sum=[sumOn],[sumOn].
                                          Example: -sum=[size] would sum all values of numeric metadata 'size' in the
                                          result set. Note sumOn my be a regex which expands sumOn to all matching
                                          metadata classes e.g. -sum[size.*] might expanded to -sum=[sizeInKb],[sizeLoc].  [Not CGI]
   -sumByGroup=<S>                      - The sum of a numeric metadata by a group. Syntax:
                                          -sumByGroup=[sumOn]:[groupBy],[sumOn]:[groupBy]. Example:
                                          -sumByGroup=[size]:[project] would sum all values of numeric metadata 'size'
                                          grouped by 'project' giving output project 'Foo' has size '128', project 'Bar'
                                          has size '12'. Note sumOn my be a regex which expands sumOn to all matching
                                          metadata classes e.g. -sumByGroup[size.*]:[project] might expanded to
                                          -sumByGroup=[sizeInKb]:[project],[sizeLoc]:[project].  [Not CGI]
L. Logging options:

   -ip_to_log=<S>                       - What form of ip to include in log files: (nothing|ip|ip_hash|remote_user).
   -log=<B>                             - Write query log entries (dflt on).  [Not CGI]
   -qlog_file=<S>                       - If writing query log entries, write them to <FILE>.  [Not CGI]
   -username=<S>                        - A string identifying the current user to be used in padre's query log.
M. Miscellaneous options:

   -countgbits=<S>                      - s is either "all" or a comma-separated list of gscope bitnumbers for which
                                          counts are needed. (Bits numbered from zero.)
   -exit_on_bad_component=<B>           - Fail when a component has an incompatible index relative to the first (rather
                                          than skip).
   -flock=<B>                           - Use flock when locking the query logfile.  If set to no, lockf is used instead.
                                          Default on Solaris is 'no', all other systems 'yes'.
   -mat=<I>                             - Set matchset size to n million (dflt 24). Only need to increase on very large
                                          collections. (range 0 - 2147) [Not CGI]
   -ndt=<B>                             - Don't do tests on docs, e.g. phantom, zombie, *scope, binary, expired.  [Not CGI]
   -unbuf=<B>                           - Don't buffer the standard output stream. In some specific cases, setting this
                                          to 'no' can improve performance.
   -view=<S>                            - The collection view the perform the query against when in CGI mode. Normally
                                          'live' (default), 'offline' or 'snapshot###'.
N. Presentation options:

   -EORDER=<I>                          - Specify presentation order of query biased summary excerpts. 0: natural order
                                          in doc. 1: sorted by score. (dflt 0) (range 0 - 1)
   -MBL=<I>                             - Set buffer length per displayed metadata field to n bytes (dflt 250 bytes).
                                          Warning: setting very large values will increase query processor memory demands
                                          and may cause problems. (range 1 - unlimited)
   -SBL=<I>                             - Set summary buffer length to n bytes.  (dflt 250 bytes) (range 1 - unlimited)
   -SF=<S>                              - Metadata fields to include in summaries. (if applicable). To include fields
                                          `author` and `d` set this to `[author,d]`. This option also supports regex to
                                          include all metadata classes set this to `[.\*]` to include fields prefixed
                                          with `Fun` and metadata class `author` set `[Fun.*,author]`.
   -SHLM=<I>                            - Select highlighting method within snippets in XML. 0 - No highlighting ; 1 -
                                          HTML strong tags ; 2 - Show highlighting regexp. and unhighlighted summary
                                          [dflt];  5 - Use HTML strong tags but remove accents from summary before
                                          highlighting, provided query was not accented. (range 0 - 7)
   -SM=<S>                              - Summary mode. Possible values are 'both' (or 'def') - Display description or
                                          query-bias summary and metadata fields listed in the 'SF' option; 'snip' -
                                          display a generated snippet; 'meta' - display metadata fields listed in 'SF'.;
                                          'qb' - display a query-biased summary; 'auto' - Print metadata codes if
                                          specified in user query.; 'off' - Turn off all summaries.
   -SQE=<I>                             - Set max no. of query biased summary excerpts to n  (dflt 3). (range 1 - 10000)
   -all_summary_text=<B>                - Is text used for generating summaries required in the result
   -countUniqueByGroupSensitive=<B>     - Treat group names and metadata items case sensitively (default no).  [Not CGI]
   -ctest_mode=<I>                      - Controls behaviour of padre-sw when -ctest is used.  0: no internal evaluation;
                                           1 - internal evaluation only.  Output is brief plain text report of measures;
                                          2 - internal evaluation only. Output in plain text with QBQ output followed by
                                          measures;  3 - internal evaluation plus normal CTOUT output in XML (with
                                          measures presented as comments) (range 0 - 3)
   -explain=<B>                         - Explain rankings by showing score components. (Note that -explain=on turns off
                                          result set diversification).
   -explore=<I>                         - Show 'explore' links against results.  The value specifies how many terms to
                                          include in the expanded query. (range 7 - 50)
   -gscoperesult=<S>                    - Specifies the bit number that results will be set to in -res gscope or -res
                                          docnums modes (dflt 1).
   -mdsfhl=<B>                          - Are query terms only highlighted in MDSF metadata summaries
   -num_ranks=<I>                       - Limit number of results to n (min = 0, dflt = 10). (range 0 - unlimited)
   -num_tiers=<I>                       - Limit number of result list tiers to n (min = 0 (no ,limit), max = 50, dflt no
                                          limit) (range 0 - 50)
   -qieval=<F>                          - Set the value presented for query independent evidence when using the qiecfg
                                          result format. (dflt 0.5). (range 0.000000 - 1.000000)
   -qwhl=<S>                            - Determines which parts of a search result are highlighted.  S - snippet, M -
                                          metadata, U - URL, T - title.  E.g. -qwhl=MUT
   -res=<S>                             - Set result format. Possible values are:  `trec`, `web`, `xml`, `urls`, `qiez`,
                                          `qieo`, `gscope`, `docnums`, `ctest`, `qiecfg` or `flcfg`. Note: setting res to
                                          docnums, flcfg, gscope, qiecfg, qieo or qiez will override any num_ranks
                                          setting so that all results are returned.
   -results_in_facet_categories=<I>     - Include the specified number of pre-computed search results under the rmc count
                                          element for metadata facet categories. (range 0 - 100)
   -rmc_maxperfield=<I>                 - Set maximum number of RMC items to display per field at n (dflt 100). (range 0 - unlimited)
   -rmc_sensitive=<B>                   - Treat facet categories (RMC items) case sensitively (default no).  [Not CGI]
   -show_qsyntax_tree=<B>               - Include an SVG representation of the query-as-processed in output.
   -start_rank=<I>                      - Present results starting from n (dflt 1). (range 1 - unlimited)
   -sumByGroupSensitive=<B>             - Treat group names case sensitively (default no).  [Not CGI]
   -tierbars=<B>                        - Display tierbars in result list output (XML and HTML). When turned on (for all
                                          -res modes) and -sort is used, results will be first sorted by tier then by the
                                          sorting mode, otherwise if -sortall is used then all results will be sorted
                                          regardless of tier.
   -translucent_DLS_fields=<S>          - Metadata fields which are translucent. Translucent fields are visible on
                                          documents which the user can not see. To include fields 'a' and 'd' set this to
                                          '[a,d]'. If collapsing is enabled and the collapsing signature contains only
                                          fields defined here than collapsing will be permitted on documents the user can
                                          not see.   [Not CGI]
O. Query interpretation options:

   -STOP=<S>                            - Use the stoplist specified in <file> (one word per line)  [Not CGI]
   -binary=<I>                          - Determines whether or not binary documents are returned in the results.  0 -
                                          show all documents;  1 - show only binary documents;  2 - show only non-binary
                                          documents. (range 0 - 3)
   -clive=<S>                           - Dynamic metacollections.  Specifies a component name within a .sdinfo file(s)
                                          to make active. Can be set multiple times to enable multiple collections.
   -daat_termination_type=<I>           - Selects how DAAT early exit is determined.  0 - try for d results with every
                                          metafield and every component;  1 - try for d results over every component but
                                          not necessarily every metafield;  2 - stop a soon as d results are obtained.
                                          (d is the parameter to -daat.) (range 0 - 2)
   -daat_timeout=<F>                    - Impose a soft timeout (in seconds) on the time taken by the DAAT machinery for
                                          one query. (range 0.000000 - 3600.000000) [Not CGI]
   -dont_estimate_full_matches=<B>      - In DAAT mode don't guess the number of full matches when the DAAT depth did not
                                          let us processes an entire postings list.
   -events=<B>                          - Must be set if event search is to be used
   -fmo=<B>                             - Present full matches only.
   -lang=<S>                            - If a 2-character language code is specified by this means, then stemmers etc
                                          specific to that language will be used, IF AVAILABLE.  It is also permissible
                                          to use a 5-character code like en_GB, but padre behaviour will be the same as
                                          for en.  Specifying lang also makes title and metadata sorting of results
                                          locale-specific, however support for this on Windows platforms is limited and
                                          problematic.
   -loose=<I>                           - Phrase looseness in words (min = 0, dflt = 0). (range 0 - unlimited)
   -max_qbatch=<I>                      - Terminate batch query processing after the specified number of queries have
                                          been processed. (range 1 - unlimited)
   -max_terms=<I>                       - Truncate queries after the specified number of terms.  If the query is
                                          reordered, truncation will occur after reordering. (range 1 - unlimited)
   -min_truncated_len=<I>               - The text part of a query term with a right truncation operator must have at
                                          least this length.  E.g. if min_truncated_len were 4 funnel* would be accepted
                                          but fun* would be processed as fun. (range 0 - 20) [Not CGI]
   -noexpired=<B>                       - Exclude expired docs from results.  (Nullified by -zom)  [Not CGI]
   -nulqok=<B>                          - An empty query submitted via CGI will be processed as a null query. The system
                                          query must be empty as well.  (dflt is to ignore the request).  [Not CGI]
   -phrase_prox_word_limit=<I>          - Phrase or proximity terms with more than this number of words will be shortened
                                          by deleting words from the right.  E.g. If this limit were 4 then `to be or not
                                          to be` would be processed as `to be or not` (range 1 - unlimited) [Not CGI]
   -prox=<I>                            - Proximity limit in words (min = 0, dflt = 15). (range 0 - unlimited)
   -qsup=<S>                            - When blending queries, determines sources of supplementary queries to be tried,
                                          with corresponding weights assigned to each source (ranging from 0 to 1).  No
                                          spaces. 'off' may be specified to disable supplementary queries.  E.g.
                                          -qsup=SPEL/0.9+USUK/0.4+SYNS/0.1+LANG/0.1. Available sources are: SPEL
                                          (spelling suggestions); USUK (table of spelling differences between US and UK
                                          English); SYNS (synonyms as defined by the blending.cfg file); LANG
                                          (experimental German decompunding); VSYN (vector synonyms as defined by the
                                          vector_blends.cfg file)
   -query_reorder=<B>                   - Reorder terms in query so that the most discriminating (least common) appear
                                          first. Often coupled with -max_terms=
   -ras=<I>                             - Remove any stopwords from the query. Possible values: 0 - remove none; 1 -
                                          remove dynamically depending on the query; 2 - remove all stopwords (dflt 1). (range 0 - 2)
   -service_volume=<S>                  - Either 'high' or 'low'.  A convenience setting to increase or reduce allowable
                                          query complexity and timeouts according to service volumes -- large or small
                                          indexes, high or low query volumes.  [Not CGI]
   -stem=<I>                            - Controls stemming of queries. 0 - do not stem (dflt); 1 - do not stem (replaces
                                          obsolete option); 2 - Stem all query words (light - English/French
                                          plural/singular only); 3 - Stem all query words(heavier). (range 0 - 3)
   -stem_lconly=<B>                     - When stemming, stem only lowercase query words (to avoid stemming proper names
                                          and acronyms).
   -strip_invalid_utf8=<B>              - Normally, invalid UTF-8 characters are removed during indexing.  If this hasn't
                                          happened. This option allows them to be removed from result packets.
   -synonyms=<B>                        - If set, the query processor will expand queries using thesaurus in synonyms.cfg.
   -truncation_allowed=<I>              - Enables the use of the * operator, binary valued, it is only valid in use with
                                          an option that disables DAAT mode such as, -service_volume='lo' or -daat=0.
                                          When applied all contexts are available such as: *:funnelback, funnel*, *back,
                                          and *:*elba*. (range 0 - 3) [Not CGI]
   -wildcard_thresh=<I>                 - If the postings list for a term is longer than the specified value (in MB) it
                                          will be treated as a wildcard. (range 0 - unlimited)
   -zom=<B>                             - Include docs in results even if noindex or killed.
P. Query source options:

   -ctest=<S>                           - Read a batch of queries from testfile (in C_TEST format). Sets output format to
                                          RM_CTEST, but that may be overridden. (See es.csiro.au/C-TEST/ for information
                                          about C-TEST.)  [Not CGI]
   -s=<S>                               - System-generated query inserted behind the scenes by a form or front-end.
Q. Quicklinks options:

   -QL=<I>                              - Activate QuickLinks facility for default pages down to the specified level. 0 -
                                          off;  1 - server root pages; 2 - next level down. (range 0 - 5)
   -QL_rank=<I>                         - If QuickLinks capability is active, show quick links for search results down to
                                          the specified rank. (range 1 - unlimited)
   -QL_rank_is_relative=<B>             - If true, the value of QL_rank will be interpreted relative to the start_rank.
                                          E.g. if QL_rank=2, the first two results on each page may show QuickLinks.
R. Ranking options:

   -SameSiteSuppressionExponent=<F>     - Same site suppression penalty exponent (dflt 0.5, recommended range 0.2 - 0.7). (range 0.000000 - unlimited)
   -SameSiteSuppressionOffset=<I>       - Number of additional documents from a site beyond the first that are allowed
                                          their full score before applying a same site suppression penalty (dflt 0) (range 0 - 1000)
   -absscores=<B>                       - Report content scores as % of max possible Okapi score (Intended for use with
                                          -vsimple=on).
   -anniemode=<I>                       - Control the use of annotation indexes. 0 - do not use annotation indexes ; 1 -
                                          Process queries using annotation indexes only; 2 - Process queries using
                                          annotation indexes, falling back to normal indexes if insufficient results.
                                          (Most query op.s stripped.) 3 - Process queries using both annotation and
                                          normal indexes (Most operators stripped from queries.). Default 0. (range 0 - 3)
   -b=<F>                               - Set Okapi b to f (dflt 0.75) (range 0.000000 - unlimited)
   -cgscope1=<S>                        - Documents matching this gscope expression (reverse Polish) can be upweighted
                                          with -cool.68.  Those not matching can be upweighted with -cool.70.
   -cgscope2=<S>                        - Documents matching this gscope expression (reverse Polish) can be upweighted
                                          with -cool.69.  Those not matching can be upweighted with -cool.71.
   -cool=<B>                            - Whether to use topic distillation scoring (cool and cooler). Dflt on.
   -cool.<K=V>                            - cool.N=V Set a value for the Nth tune parameter.
               Possible values for N are:
       0 | content: Content weight
       1 | onlink: On-site link weight
       2 | offlink: Off-site link weight
       3 | urllen: URL length weight
       4 | qie: External evidence (QIE) weight
       5 | date_proximity: Date proximity weight
       6 | urltype: URL attractiveness
       7 | annie: Annotation weight (ANNIE)
       8 | domain_weight: Domain weight
       9 | geoprox: Proximity to origin
       10 | nonbin: Non-binariness
       11 | no_ads: Advertisements
       12 | imp_phrase: Implicit phrase matching
       13 | consistency: Consistency of evidence
       14 | log_annie: Logarithm of annotation weight
       15 | anlog_annie: Absolute-normalised logarithm of annotation weight
       16 | annie_rank: Annotation rank
       17 | BM25F: Field-weighted Okapi score
       18 | an_okapi: Absolute-normalised Okapi score.
       19 | BM25F_rank: Field-weighted Okapi rank
       20 | mainhosts: Main hosts bias
       21 | comp_wt: Data source component weighting
       22 | document_number: Document number in the index
       23 | host_incoming_link_score: Host incoming link score
       24 | host_click_score: Host click score
       25 | host_linking_hosts_score: Host linking hosts score
       26 | host_linked_hosts_score: Host linked host score
       27 | host_rank_in_crawl_order_score: Host rank in crawl
       28 | host_domain_shallowness_score: Domain shallowness
       29 | doc_matches_regex: Document URL matches regex pattern
       30 | doc_does_not_match_regex: Document URL does not match regex pattern
       31 | titleWords: Normalized title words
       32 | contentWords: Normalized content words
       33 | compressionFactor: Normalized compressibility of document text
       34 | entropy: Normalized document entropy
       35 | stopwordFraction: Normalized stop word fraction
       36 | stopwordCover: Normalized stop word cover
       37 | averageTermLen: Normalized average term length
       38 | distinctWords: Normalized distinct words
       39 | maxFreq: Normalized maximum term frequency
       40 | titleWords_neg: Negative normalized title words
       41 | contentWords_neg: Negative normalized content words
       42 | compressionFactor_neg: Negative normalized compressibility of document text
       43 | entropy_neg: Negative normalized document entropy
       44 | stopwordFraction_neg: Negative normalized stop word fraction
       45 | stopwordCover_neg: Negative normalized stop word cover
       46 | averageTermLen_neg: Negative normalized average term length
       47 | distinctWords_neg: Negative normalized distinct words
       48 | maxFreq_neg: Negative normalized maximum term frequency
       49 | titleWords_abs: Absolute title words
       50 | contentWords_abs: Absolute content words
       51 | compressionFactor_abs: Absolute compressibility of document text
       52 | entropy_abs: Absolute document entropy
       53 | stopwordFraction_abs: Absolute stop word fraction
       54 | stopwordCover_abs: Absolute stop word cover
       55 | averageTermLen_abs: Absolute average term length
       56 | distinctWords_abs: Absolute distinct words
       57 | maxFreq_abs: Absolute maximum term frequency
       58 | titleWords_abs_neg: Negative absolute title words
       59 | contentWords_abs_neg: Negative absolute content words
       60 | compressionFactor_abs_neg: Negative absolute compressibility of document text
       61 | entropy_abs_neg: Negative absolute document entropy
       62 | stopwordFraction_abs_neg: Negative absolute stop word fraction
       63 | stopwordCover_abs_neg: Negative absolute stop word cover
       64 | averageTermLen_abs_neg: Negative absolute average term length
       65 | distinctWords_abs_neg: Negative absolute distinct words
       66 | maxFreq_abs_neg: Negative absolute maximum term frequency
       67 | lexical_span_score: Lexical span
       68 | doc_matches_cgscope1: Document matches `cgscope1`
       69 | doc_matches_cgscope2: Document matches `cgscope2`
       70 | doc_does_not_match_cgscope1: Document does not match `cgscope1`
       71 | doc_does_not_match_cgscope2: Document does not match `cgscope2`
       72 | raw_annie: Raw ANNIE
   -daat=<I>                            - Specifies the maximum number of full matches for Document-At-A-Time processing.
                                          If set to 0, Term-At-A-Time is used instead (dflt 5000). (range 0 - 10000000)
   -diversity_rank_limit=<I>            - Diversification won't alter ranking beyond rank n (default 200, min 10). (range 10 - unlimited)
   -facet_url_prefix=<S>                - Present only results whose URL is prefixed by the given URL. Note that the
                                          scheme and hostname part are case insensitive, for URI with scheme smb:// the
                                          entire prefix is case insensitive. The behaviour of this option may change in
                                          the future to suit facets, this should not be used outside of faceted
                                          navigation.  [Not CGI]
   -gscope1=<S>                         - Present only results whose gscope bits match reverse Polish expression `e`
                                          (Bits numbered from zero). If set to `off`, disable any previous expression.
   -k1=<F>                              - Set Okapi K1 to <f>. (dflt 2.0) (range 0.000000 - unlimited)
   -kmod=<I>                            - Select special scoring function i for special fields.  0 = normal, 1 = AF1
                                          (dflt 1). (range 0 - 1)
   -lscope=<S>                          - Present only results whose URL matches a sort-of left-anchored pattern.
   -lscorrect=<B>                       - Whether to correct link scores across meta collection components (default yes).
   -main_homepage_factor=<F>            - Penalise score of the homepage of a single-entity-controlled domain to prevent
                                          over representation in results sets.  E.g. www.anu.edu.au/ in an index of ANU.
                                          (dflt 0.90) (range 0.000000 - 1.000000)
   -meta_suppression_field=<S>          - If same_meta_suppression is activated, the specified metadata field will be the
                                          field to which it applies.  Only one metadata field can be treated in this way.
   -near_dup_factor=<F>                 - The query processor will penalise a result which is a near-duplicate of a
                                          previous result  by multiplying by the factor specified. The penalty stiffens
                                          with more repetition. (dflt 0.5) (range 0.000000 - 1.000000)
   -promote_urls=<S>                    - Insert the specified URLs at or near the top of the results list for a query.
                                          Value is a space separated list of URLs.  URLs must correspond to those
                                          recorded by padre-iw.  (dflt Inactive)
   -quanta=<I>                          - Set the number of possible score quantisation levels for each cool variable.
                                          In general, a high number should give more accurate ranking but may slow query
                                          processing. (range 10 - 100000)
   -rank_limit=<I>                      - Limit highest rank requestable to n (dflt 1,000,000,000). (range 10 - unlimited)
   -ranking_profile=<I>                 - Choose a profile of settings for the ranking function.  0 - current default; 1
                                          - Standard BM25; 2 - Traditional (pre-12.0) Funnelback.  Setting a profile does
                                          not override explicit settings. (range 0 - 100) [Not CGI]
   -recency_decay_vals=<S>              - <z,w,m,y,d,c,m> - Define how recency scores decay with time. z w, m, y, d, c, m
                                          are floats in the range 0 - 1,  which specify the recency score assigned to
                                          documents, 0 days, 1 wk, 1 mth, 1 yr, 1 dec, 1 cen, 1 mill. old. (dflt
                                          1.0,0.75,0.5,0.25,0.025,0.0025) Recency scores between key values linearly
                                          interpolated. Past the millennium, recency scores are 1/daysold.
   -reference_date=<S>                  - If specified, recency is based on this date rather than that of most recent
                                          doc. Format is <yyyymmdd>, or 'today'.
   -remove_urls=<S>                     - Prevent the specified URLs from appearing in the results for a query.  Value is
                                          a space separated list of URLs.  URLs must correspond to those recorded by
                                          padre-iw.  (dflt Inactive)
   -sco=<S>                             - <n>[<classes>] Set doc scoring mode to n, using the classes specified.  Most
                                          common values: 0 - score using doc text only ;  1 - no scoring.  Produce an
                                          unordered set of results ; 2 - score using anchortext and URLs as well,
                                          upweight titles (or whatever fields are configured with -specf). For example to
                                          automatically look in fields 'u' and 'v' for the query terms set -sco=2[u,v]
   -scope=<S>                           - Present only results whose URL satisfies the include/exclude scopes included in
                                          list (comma separated). e.g. -scope=anu.edu.au,-anu.edu.au/archives
   -sort=<S>                            - Sort top results by <string>. Possible values: 'date', 'adate' (ascending
                                          date), 'title', 'dtitle' (descending title), 'size' (file size), 'dsize'
                                          (descending filesize), 'url', 'durl' (descending url), 'coll' (collection name,
                                          then score), 'dcoll' (descending collection name, then score), 'meta<f>' (by
                                          metadata field f, then score),'dmeta<f>' (descending metadata field d, then
                                          score), 'shuffle' (random to avoid bias), 'collapse_count' (to order by the
                                          number of collapsed documents, with the largest collapsed set first),
                                          'acollapse_count' (with the largest collapsed set last), 'prox' (for geo
                                          search:  Sort top results by proximity to origin), 'dprox' (for geo search:
                                          Sort top results by descending proximity to origin). 'score_ignoring_tiers'
                                          (descending score, ignoring any tiers. Only useful with sortall.) (dflt is
                                          case-insensitive for title and meta). '-sort=' turns off sorting.
   -sort_sensitive=<B>                  - Use case-sensitive sorting when sorting results by title or metadata strings.
   -sortall=<B>                         - Include partial matches in the resorting performed by -sort.
   -specf=<S>                           - Fields listed in string s, as a list of comma separated fields surrounded by
                                          square brackets, will be scored specially and added to query when using the
                                          -sco=2 mode (dflt '[k,K]').
   -sss_defeat_pattern=<S>              - URLs matching the specified pattern (currently a simple string match) will not
                                          be subject to samesite suppression.
   -static_cool_exponent=<F>            - Control the extent to which static scores are attenuated with length of query.
                                          0 => no attenuation; 1 => max attenuation. Attenuation by len ** -f. (range 0.000000 - 1.000000)
   -unknown_daysold=<I>                 - A doc with unknown date is assumed to be d days old (for recency calcs) (dflt
                                          366). (range 0 - unlimited)
   -use_Paik=<B>                        - Use the tf.idf scheme proposed by Jiaul Paik at SIGIR 2013 rather than the more
                                          conventional BM25 variant.
   -use_secds=<B>                       - When working with domain-importance features in ranking, use SECDs if value is
                                          on, and raw domain names otherwise.
   -vsimple=<S>                         - Very simple ranking. If set to 'on', equivalent to -sco=0 -cool=off -SSS=0
                                          -kmod=0.
   -weight_only_fields=<S>              - Documents will not be retrieved in DAAT mode if they only match unfielded query
                                          terms in one or more of the implicit fields listed here.  For example,
                                          specifying '[K,k]' will stop the query 'Monica Lewinski' matching a document
                                          solely because of click data or referring anchortext.
   -wmeta.<K=V>                           - wmeta.C=F Set upweighting factors for metadata class scoring. C - metadata
                                          class;  F - weight to set. (dflt 0.5 for 'k' and 'K', 1 for everything else).
   -xscope=<S>                          - Present only results whose URL exactly matches the provided URL (after
                                          canonicalization).
S. Ranking - Result diversification options:

   -SSS=<I>                             - Same site suppression depth: 0 - no suppression (dflt);  2 - hosts and their
                                          top level dir's; 10 - org domain (includes sub-domains) e.g. defence.gov.au. (range 0 - 10)
   -neardup=<F>                         - Near dupulicates in ranking are multiplied by f. Setting f to 1 turns off
                                          near-dup detection. (range 0.000000 - 1.000000)
   -repetitiousness_factor=<F>          - Penalise a repetitious result by multiplying by the factor specified.
                                          (Repetitiousness may involve same-site, same component or repeated metadata.)
                                          The penalty stiffens with more repetition. Setting to 1 turns this off.  (dflt
                                          1.0) (range 0.000000 - 1.000000)
   -same_collection_suppression=<F>     - While searching a meta-collection, penalise the second result from the same
                                          primary collection as a previous result by multiplying by the factor specified.
                                          The penalty stiffens with more repetition. Setting to 0 turns this off. (dflt 0) (range 0.000000 - 1.000000)
   -same_meta_suppression=<F>           - Penalise the second result with the same value for a specified metafield as a
                                          previous result by multiplying by the factor specified. The penalty stiffens
                                          with more repetition. Setting to 0 turns this off. (dflt 0) (range 0.000000 - 1.000000)
   -title_dup_factor=<F>                - The query processor will penalise a result which has exactly the same title as
                                          a previous result by multiplying by the factor specified. The penalty stiffens
                                          with more repetition. Setting to 1 turns this off. (dflt 0.5) (range 0.000000 - 1.000000)
T. Result collapsing options:

   -collapsed_docs_sort=<S>             - Sort collapsed results by <string>. Possible values: 'date', 'adate' (ascending
                                          date), 'title', 'dtitle' (descending title), 'size' (file size), 'dsize'
                                          (descending filesize), 'url', 'durl' (descending url), 'coll' (collection name,
                                          then score), 'dcoll' (descending collection name, then score), 'meta<f>' (by
                                          metadata field f, then score),'dmeta<f>' (descending metadata field d, then
                                          score), 'shuffle' (random to avoid bias), 'prox' (for geo search:  Sort
                                          collapsed results by proximity to origin), 'dprox' (for geo search:  Sort
                                          collapsed results by descending proximity to origin). 'score_ignoring_tiers'
                                          (descending score, ignoring any tiers. Only useful with sortall.)
   -collapsing=<B>                      - Activate collapsing. Collapsing will be based on document content ('$') unless
                                          a collapsing_sig value is specified. Note that use of this option will disable
                                          result set diversification.
   -collapsing_SF=<S>                   - Metadata fields to include in display for collapsed documents (assuming
                                          collapsing_num_ranks is non-zero).  (dflt no fields). To view metadata fields
                                          'id' and 'a' set this to '[id,a]'.
   -collapsing_label=<S>                - Label to indicate why items have been collapsed.  (dflt "which are very
                                          similar")
   -collapsing_num_ranks=<I>            - Specify how many collapsed results are to be shown under the uncollapsed ones.
                                          (dflt 0) (range 0 - 1000)
   -collapsing_scoped=<B>               - Scope to only documents which have been collapsed on. Default is off.
   -collapsing_sig=<S>                  - The collapsing_control segment to use when collapsing.  E.g. "[a,p]", collapse
                                          on author+publisher. The value must correspond to one segment of the
                                          indexing.collapse_fields string. (A segment is a comma separated list of fields
                                          surrounded by square brackets) (dflt '[$]' (Collapsing on document content.))
U. Security options:

   -dls_internal_test=<I>               - This allows testing of the padre side of the custom document level security
                                          mechanism. There is no call out to an external function. The value is
                                          interpreted as a combination of bits:  1 bit - dls_internal_test is active/not
                                          active; 2 bit - selects whether MINRESULTS mode is used or not. During internal
                                          testing, every odd numbered document in the original ranking is arbitrarily
                                          treated as inaccessible. (range 0 - unlimited)
   -ipreject=<S>                        - `QUERY_LIMIT,WINDOW_SECONDS,UPPER_QUERY_LIMIT` - Use an IP rejector to limit
                                          requests from a single machine.  Allow `QUERY_LIMIT` queries per
                                          `WINDOW_SECONDS`, don't record more than `UPPER_QUERY_LIMIT` queries.  [Not CGI]
   -ldLibraryPath=<S>                   - Full path to security plugin library  [Not CGI]
   -locking_model=<S>                   - Name of locking model, either "trim" or "sharepoint".  [Not CGI]
   -no_security=<B>                     - Disable DLS, available as a command line option.  [Not CGI]
   -secPlugin=<S>                       - Name of security plugin library  [Not CGI]
   -translucent_DLS=<B>                 - Enables translucent DLS DAAT only.  [Not CGI]
   -userkeys=<S>                        - Conduct this search with security keys specified by s. The format is
                                          '<collectionName>;key<delim>' where delim is either ',' or new line, spaces are
                                          removed for example 'c1;k1
c2;k1,c2;k2'  [Not CGI]
V. Spelling options:

   -spelling=<B>                        - Activate spelling suggestion mechanism.
   -spelling_alpha=<F>                  - Set the weighting between 'closeness to the query' and support in the
                                          collection for a candidate suggestion. Big alpha, high weight on closeness to
                                          the query. (dflt 0.7) (range 0.000000 - 1.000000)
   -spelling_blend_thresh=<F>           - Confidence threshold for automatically blending results for a query suggestion
                                          with those from the user's original query. (dflt 0.67) (range 0.000000 - 1.000000)
   -spelling_difflen_thresh=<I>         - Don't make suggestions more than i characters longer or shorter than query.
                                          (dflt 2) (range 0 - 1000)
   -spelling_dym_thresh=<F>             - Confidence threshold for making a 'Did you mean' suggestion. (dflt 0.5) (range 0.000000 - 1.000000)
   -spelling_edist_constant=<F>         - Don't make suggestions whose edit distance from the query exceeds f +
                                          query_length * spelling_edist_proportion. (dflt 1) (range 0.000000 - 1000.000000)
   -spelling_edist_proportion=<F>       - Don't make suggestions whose edit distance from the query exceeds
                                          spelling_edist_constant + query_length * f (0<=f<=1). (dflt 0.25) (range 0.000000 - 1.000000)
   -spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are at least f * log10(num docs) full
                                          matches. (dflt 30.0) (range 0.000000 - unlimited)
   -spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are at least f * log10(num docs) full
                                          matches. (dflt 30.0) (range 0.000000 - inf)
   -spelling_include_context=<B>        - Include the non-corrected part of the query in the suggestion link. (dflt 1)
   -spelling_min_querylen=<I>           - Suggestions not made for queries shorter than this. (dflt 2) (range 1 - 1000)
   -spelling_wt_thresh=<F>              - Sets a threshold that determines if a spelling suggestion is returned. If the
                                          generated spelling suggestion weight is less than this, the suggestion is not
                                          returned. (dflt 0.01) (range 0.000000 - 100.000000)
W. TREC specific options:

   -trec_runid=<S>                      - For TREC participation: Each result in TREC format will include this runid.
   -trec_topic=<I>                      - For TREC participation: The first query in a batch will get this topic number.
                                          Each new query will increase the number by one. (range 0 - unlimited)
   -trecids=<B>                         - For TREC participation: Each result in TREC format will use the TREC docno
                                          rather than a URL

41. padre-term-list-generator

Allowed options:
  -h, --help                        produce help message
  -v, --verbose                     be verbose
  -o, --output-file arg             Output filename (e.g. index.query_terms)
  --log-level arg (=info)           Log level to use: info|trace|debug|warn|error|fatal|all|off
  --column arg (=2)                 Column for CSV data to extract, starting from 0 (first column)
  --chamber-size arg (=1073741824)  Chamber size for uncompressed data, used by warc decompressor, in bytes (default: 1GB, 1*1024*1024*1024 bytes)
  --stop-words arg                  Stop words file to use
  --case-sensitive-compare          Case-sensitive compare with stop words list
  --days-to-include arg (=90)       Number of days to include

42. padre-topk

Missing required argument -input=<input file>
List the top-k most frequent items
Usage: padre-topk -input=<input file> [-capacity=<integer>] [-k=<integer>]

Where:
  * -input: is the file of items where each item is delimited by new line
  * -capacity: this is the limit as to the number of items that will be held in memory at once.
  * -k: this is the number of items that will shown at the end.

Efficiently (compared to some other algorithims) estimates the count of the most frequent top-k items
e.g. for a,b,c,a,b,a the top-2 would be a with a count of 3 followed by b with a count of 2

43. padre-vector-synonym

Allowed options:
  -h, --help                          produce help message
  -v, --verbose                       be verbose
  -e, --engine arg (=hnswlib)         vector storage engine to use (default: hnswlib)
  -a, --algorithm arg (=hnsw)         algorithm for vector storage (default: hnsw, alternate: bruteforce)
  -d, --dimensions arg (=1024)        dimensions for each element in vector storage (default: 1024 - correct size for AWS cli cohere model)
  --max-elements arg (=1000000)       maximum elements (documents) for vector storage. (default: 1000000)
  -M, --M arg (=64)                   internal dimensionality of data. Affects memory usage. (default: 64)
  -c, --ef-construction arg (=10)     Controls index search speed/build speed tradeoff (default: 10)
  -s, --space-type arg (=ip)          Space type options: either l2 or ip (inner product, default)
  -f, --db-filename arg               filename of vector storage database (required)
  -o, --overwrite-db                  overwrites the file specified in db-filename, instead of loading it
  -u, --embedding-api-server-url arg (=http://127.0.0.1)
                                      URL for embedding API server (default: 'http://127.0.0.1/')
  -r, --route-embedding-api-server arg (=/embeddings/v1)
                                      route for embedding api server (default: '/embeddings/v1')
  -m, --model arg (=cohere.embed-english-v3)
                                      embedding API NN model
  --hash-collision-retries arg (=10)  hash collision retries (default: 10)
  --dry-run                           Dry run, do not actually use/connect to an embedding server
  --log-level arg (=info)             log level to use: info|trace|debug|warn|error|fatal|all|off
  --chamber-size arg (=33554432)      Chamber size for uncompressed data, used by warc decompressor, in bytes (default: 32*1024*1024 bytes)
  -g, --use-config                    Use config file for database as the actual parameter values for the pipeline
  --stop-words arg                    Stop words file to use
  --index-stem arg                    Index stem to use for .dctwds or .sdinfo
  --term-list arg                     Term list file to use
  --topk arg (=10)                    top K number of documents to return
  --max-score arg (=1)                Maximum score for synonym search
  --api-token arg                     api token for embedding-api server (not used at the moment)
  --output-file arg                   Output filename of the (default named) index.vector_blends file
  --fbe-option                        Uses the my-funnelback-embeddings docker image, instead of AWS CLI (NB: need to change dimensions to 384 to work properly with FBE, also need to specify model as "intfloat/e5-small-v2")
  --aws-sdk                           Uses the AWS C++ SDK to connect to AWS instead of AWS CLI -- currently disabled, using AWS CLI instead
  --aws-cli                           Uses the AWS CLI to connect to AWS

44. pan-look

Purpose: Efficient location of all lines in a sorted text file which match a prefix.

Usage: pan-look <prefix> <file name>

45. phrasefinder

Purpose: Extract frequently occurring word tuples ('phrases') from a collection.

Usage: phrasefinder <stem> [-unco] [-hash_limit=<i>] [-num_to_show=<i>] [-max_tuple=<i>] [-debug]

phrasefinder reads the .results file of <stem> and locates candidate word tuples
('phrases') up to a configurable maximum length.  Phrases are sequences
of up to max_tuple words which are unbroken by anything other than space.

Candidates are stored in a hashtable and counted.  A limit of
hash_limit candidates is stored.  Once this is reached, the program
exits. (Useful for testing or for limiting execution time and virtual
memory requirements.)  When processing finishes, the top num_to_show
candidates are sorted in descending order of frequency and output in
<stem>.phrases, in the same format as the .lex file, with word breaks
represented by hyphens.

46. run-with-flock

Purpose: Runs a command with a exclusive file lock held.

Usage: run-with-flock file_lock_path lock_acquired_path cmd arg0 arg1 ... argn


 This will open (and create) 'file_lock_path' and then will take a
exclusive file lock on the path after that it will create the file:
'lock_acquired_path', if all of that succeeds then the given command will be
executed while the lock is held.

47. show_annotations_for_doc

Purpose: Given an annotation index, summarise the annotations applied to a given URL.

Usage: show_annotations_for_doc <index_stem>|<collection> <URL>|<URL64>|<DOCNO> [-csv|-html|-xml]
  - Show the annotations applying to the specified URL and their weights.
  - Don't forget to quote or escape shell metacharacters in a URL!
  - Default output format is XML.

48. url_tagger

Purpose: Apply the tags in a tag mapping file to a PADRE index.

Usage1: url_tagger stem (-clear|<url-tags-file>)

Usage 2: url_tagger -v

url_tagger -clear clears all tags from all documents in the index
url_tagger <url-tags-file> takes an url-tags file and applies it to
the relevant entries in <stem>.results.

url_tagger -v shows version information.
Lines in the url-tags file are in the form: <url> <comma-separated-taglist>
It is assumed that <url> contains no space and tags contain no commas.