PADRE binaries usage

This page lists all the PADRE binaries and their corresponding usage message.

Usage documentation for all PADRE executables

1. FineTune

Purpose: Tuning padre-sw ranking parameters based on a C-TEST file.

Usage: FineTune <collection>[.<profile>] ... [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-alpha=<f>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][<mode> ...] [-conf] [-qp=padre_subpath] [-index_dir.<collection>=<index directory>] [-lock_file=<file to lock>]
   e.g. FineTune lse -daat -annieonly
   e.g. FineTune agosp.doha -timeout=7.5
        (Use -daat0 to tune term-at-a-time.)
        (Timeouts and query limits:  Apply separately to each mode. Defaults
         are 5 hours and 1 million queries. After a timeout, or when the
         query limit has been exceeded, the best tuning found so far for
         that mode will be recorded in the .best file for the mode.)
        (-alpha sets the balance between success rate and wmum1 in tuning. [Dflt 0.75]
          Value must lie between 0 (ignore success rate) and 1 (ignore wmum1))
        (-rvalues - sets no. of values to explore for optype=2 (real)) [Dflt 11]
        (-adjust - sets no. of steps to remove when adjusting exploration range
                   for optype=2 (real) dimensions.) [Dflt 5]
        (-conf extracts the mode to tune from collection.cfg. (N/A for multituning.))
        (-help gives more detailed instructions and exits.)
        (-index_dir can be used to set the index directory for a particular collection.
         The directory must contain a index prefixed by 'index'.)
        (-lock_file can be used to lock a file for the entire duration of tuning, if
          the lock can not be acquired tuning will not start.)
        (-redirect_stdout can be used to redirect stdout to a given file.
)        (-redirect_stderr can be used to redirect stderr to a given file.
)        (-write_finish_time_to writes the tuning finish time in ISO-8601 to the given
          file.)

2. QiTune

Purpose: Tuning padre-sw query-independent ranking parameters based on a C-TEST file.

Usage: QiTune stem C-TESTfile

Given a PADRE index and a file of useful URLs (extracted from the
C-TEST file) compute a set of query-independent cool settings
suitable for passing to padre-do which (hopefully) optimise the
difference in ave scores between the useful docs and the general collection.


3. SpellTune

Purpose: Tuning PADRE spelling suggestion system based on a test file e.g. mycoll.spelltest.

Usage: SpellTune <collection>[.<profile>] ... [-tune_bsi] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][-qp=padre_subpath]
   e.g. SpellTune lse -annieonly
   e.g. SpellTune agosp.doha -timeout=7.5
        (Timeouts and query limit defaults are 5 hours 
         and 1 million queries. After a timeout, or when the
         query limit has been exceeded, the best tuning found so far
         will be recorded in the .bestspell file for the mode.)
        (-tune_bsi - tunes the build_spelling_index params.  Slow.)
        (-rvalues - sets no. of values to explore for optype=2 (real)) [Dflt 11]
        (-adjust - sets no. of steps to remove when adjusting exploration range
                   for optype=2 (real) dimensions.) [Dflt 5]
        (-help gives detailed instructions.)

Invalid usage.



4. annie-a

Annie version 1.13 (11 Mar 2010)
Purpose: Builds an annotation index for a collection, specified by <stem>, from a list of files in anchors.gz format.

Usage: annie-a <stem> [<stem_or_file> ...] [-phrasefile=<filename>] [-deb] [-hashbits <10..30>] [-maxlines <n>] [-wts <wt0> <wt1> <wt2> <wt3> <wt4>] [-stripstops] [-STOP=<filename>] [-canon] [-rejecturls] [-rejectnumeric] [-quicken] [-maxwds <i>] [-maxlen <i>] [-build_annou=on/off] [-build_lcache=on/off] [-nep_limit=0|1|2|3]
    <stem> must reference either a meta collection or a primary index.  
    <stem_or_file> may be either a stem as above or the name of a file in anchors.gz format.
 In the case <stem> or <stem_or_file> is a meta collection, annie-a will look for the anchors.gz files from each of the component collections and use them for creating the annotation index for the collection specified by <stem>. If any anchors.gz file changes for a component collection, annie-a will need to be run again for the meta collection.

-quicken    improves query performance by using <coll id, doc id> pairs. The coll id is dependent on the sdinfo file, if the sdinfo file is changed annie-a will need to be run again for the meta collection with this option. It is recommended that the most recent collection is placed at the top of the sdinfo file.


5. annie-quicken

Purpose: Convert URL references in an annotation index into (component, docno) to speed query processing.

Usage: annie-quicken anno_stem index_stem

6. build_autoc

Purpose: To build a query completion file (.autoc) from a list of input files.

Usage: build_autoc stem input_file ... [-collection name -profile name] [-partials] [-label_organics] [-debug]
       e.g. build_autoc index blah.csv
       where blah.csv will be sorted and indexed into index.autoc.  Input_file(s)
       must end in .csv, .suggest, or .cfg

      -profile <name> - generate scoped .autoc file for the specified
                        profile. A previous run of build_autoc must have
                        been called with -index.
   -collection <name> - generate scoped .autoc file for the specified
                        collection. Both -profile and -collection need
                        to be specified when generating scoped suggestions
            -partials - this version allows multi-word organic
                        suggestions to be triggered either from the full
                        suggestions or from trailing word sequences.  E.g.
                        'big fat cat' triggered from 'fat cat' and 'cat' as
                        well as the full string. This option turns that on.
      -label_organics - present a category label for all the organic completions
      -sample <val>   - Sample postings of suggestion terms, to handle large
                        collections, <val> ranges 0 - 300; speeds up processing
                        with the effect of sampling the suggestions. 
                        (1/val postings are used).

Note: build_autoc can now build a single .autoc file from multiple input
files of the same or different types.  Files with very simple
format can be combined with hand-crafted files containing
complex actions.  Completion weights from a .suggest file
are automatically determined, while they can be manually specified
in a CSV file.  Completion weights from Best Bets 
default to 100.

SUGGEST FORMAT
--------------
.suggest files built by build_spelling_index can be supplied as
input.  Reasons for doing this include taking advantage of an index
optimised for completion purposes; and integrating automated spelling
suggestions with hand-crafted entries.

CFG FORMAT
----------
Input files with .cfg suffix are expected to be in Best Bets
format.  Only exact-match lines (beginning with '+') are considered
and the target of each such line is recorded as a suggestion.

CSV FORMAT
----------
Each line of a .csv file must contain eight fields (7 commas),
corresponding to: key, weight, display, display_type, category,
category_type, action, and action_type.  Fields except key and
weight may be empty.
Two meta characters are recognized within a field: backslash and
double quote.  These are handled as follows in the two cases:

(A) Unquoted text: A single backslash is not passed through, while
the character following it is passed through without applying
any tests. This means that a double backslash in input leads to
a single backslash in output and that commas or double quotes
preceded by a backslash do not have their normal meaning.
(B) Quoted text:  The double quotes beginning and ending a quoted
section are not passed through.  Within a quoted section a double
quote may be passed through by either doubling it ("") or by
preceding it with a backslash (\").

By these means it is possible to pass through HTML and/or JSON
containing quotes and/or commas


7. build_match_only_index

Purpose: create a match only index, which build_autoc can use  to build
profiled query suggestions.

Usage: build_match_only_index stem


8. build_spelling_index

Purpose: To build a spelling suggestion file (.suggest/.suggest2) for a collection.

Usage: build_spelling_index index_stem num_thresh [<metadata_class_names> [[<lexiweight>] [[<blacklist_file>] [<whitelist_file>]]]
       e.g. build_spelling_index index 2 [@,t,c]
       where the listed comma separated metadata class names,
       '@,t,c', are the ones to be scanned for
       suggestions.  '@' means use the .anno file.  '%' means use 
       unfielded words from index. lex.  + means use phrases from
       index.phrases (if present).  If no fields are listed, "@,+,t,%"
       is assumed.
       num_thresh - minimum weight of suggestions recorded in suggest index
       lexiweight - controls the weight of lexicon suggestions relative
         to annotations.  wt = lexiweight * sqrt(df) (dflt lexiweight = 1.00)
       blacklist_file - manual list of suggestions which should NOT be included
       in the index. (one per line)
       whitelist_file - manual list of suggestions to include in the index.
       (one per line.)

9. csv2ctest

Purpose: To convert a tuning file in CSV format into a C-TEST file for use with e.g. FineTune.

Usage: csv2ctest: infile.csv [-utils=recip|-utils=equal] [-queryweights]

   Output in C-TEST format will be in infile.ctest

The input file is assumed to be a syntactically correct comma separated
value (CSV) file in which cells are separated by commas.  Double quotes
around all or part of a cell allow inclusion of commas.  The quotes are
stripped off before processing.  The input file may contain comment
lines starting with a hash.

The first column in infile.csv is always assumed to contain a query.
If no options are given, then the remaining columns contain desired
answer URLs for that query, in descending order of utility.  Utility
scores start at 4 and then gradually decline to 1: 4, 3, 2, 1, 1, 1 ...
This behaviour may be modified as follows:

  -utils=equal   - All of the answers are given equal utility values.
  -utils=sqrt    - Utility values drop off as 1/sqrt(rank).
  -utils=recip   - Utility values drop off faster -- as 1/rank.

  -queryweights  - if this is given, the second column is expected to
                   contain the numerical weight associated with the query
                   and the remaining columns contain the answer URLs


10. dump_annotation_file

Purpose: To display the contents of an annotation index in geek-readable form.

Usage: dump_annotation_file <annotationfile>

11. dump_autoc

Purpose: To display the contents of a query completion file in geek-readable format.

Usage: dump_autoc <stem|collection|autoc_file>

12. dump_suggestion_file

Purpose: To display the contents of a spelling suggestion file in geek-readable format.

Usage: dump_suggestion_file <index_stem>
 - dumps contents of <index_stem>.suggest

13. get_docnum_from_url

Purpose: Map a URL to that document's number within an index stem.

Usage: get_docnum_from_url <index_stem> <url>

Prints the docnum for a given URL to standard out.
Prints "notfound" if the URL is not found.
  <filestem>  - the common prefix (including path) of the index files
  <url>       - the URL to look up the document number for

14. get_url_from_component_document_pair

Purpose: Within an index, output the URL of the document identified by component number and document number.

Usage: get_url_from_component_document_pair <index_stem> <component_number> <document_number>
Warning:  Doesn't handle nested .sdinfo files.  (Hierarchical meta collections.)


15. get_url_from_docnum

Purpose: print the URL of a document in a primary index, given its URL. Inverse of get_docnum_from_url.

Usage: get_url_from_docnum <index_stem> <doc_num>

16. harvest_anchortext

Purpose: Extract a subset of entries in a list of anchors.gz files which match a specified pattern.

Usage 1: harvest_anchortext -targ|-text|-any|-source pattern anchor_text_file ...

Usage 2: harvest_anchortext -noneps <affiliates_file> anchor_text_file ...

Extracts a subset of lines in the anchor_text_files.  The composition
of the subset depends upon the match_type argument as follows:

-any - Any line (source or target) which matches pattern
-targ - Any target line whose URL target matches pattern
-text - Any target line whose anchortext matches pattern
-source - Any source line which matches pattern

+ In usage 2, links within the same SECD (single-entity-controlled-domain)
  are suppressed, as are links between affiliated pairs of hosts listed in
  the affiliates_file. If there is no affiliates_file, use '-'.

+ Whenever a target line matches the corresponding source line is also output.
+ Whenever a source line matches its corresponding target lines are also output.
+ NOTE: nepotistic links are included unless -noneps is used.


17. hierarchical_navpaths

Purpose: Extract hierarchical navigation paths from a list of anchors files.

Usage: hierarchical_navpaths <stem> [-verbose] [<anchor_text_file> ...] 

Reads <stem>.anchors.gz, plus any additional anchortext files to
identify hierarchical navigation paths (HNPs).  These are output
to <stem>.hnp.anchors.gz in standard anchors.gz format:
<target_url> --- [H]<concatenated anchors from path>

+ NOTES: 
    1. inter-host links are ignored.
    2. -verbose prints the actual HN paths to stdout.
    3. All targets in the .hnp.anchors file have http://hnp as source.

Warning: Not ready for use. Development of this utility is incomplete.


18. host_host_link_counts

Purpose: Analyse a list of anchors.gz files and report on frequencies of inter-host links.

Usage: host_host_link_counts [-targ|-source <pattern>] [-report] <stem> [anchor_text_file ...]

Reads the anchor_text_files and outputs a table of host-host links,
in descending order of link count.   By default, all lines are
processed, but a pattern can optionally be applied to either
targets or sources. 

If -report is given, short and full HTML reports will be generated.

-targ - Process only links whose target host matches pattern
-source - Process only links whose source host matches pattern

Nowadays, the first option not starting with a - is an index stem.
A file <stem>.hosts is created with a table of host-related feature
scores which can be used in ranking.  The order of entries must
correspond to the hostnum order assigned by padre-iw.

+ NOTE: within-host links are excluded.


19. padre-arg-sw

Purpose: To help with conversion of padre-sw argument lists from old to key=value format


20. padre-cc

Purpose: To build an index.collapsig file to permit use of collapsed rankings.

Usage: padre-cc <index_stem> [-collapse_control=<string>] [-debug=on]
   Utility for building a .collapsig file of collapsing
   signatures.  If no control_string is given, a one-column
   file is built using the signatures from the .textsig file.

   The collapse_control string must consist of sets of sequences of
   metadata class names. Each set should be surrounded by square
   brackets, and sets should be separated by commas. Metadata class
   names are the elements of the sets and must be separated by
   commas. 

    The characters $ and # may be used as special metadata class
   names and represent document summarisable text and
   document URL respectively.

     In future, it is planned to allow special metadata class 
   names to be followed by a regular expression,
   indicating that only the part of the metadata string which matches
   the regex should be used in calculating the signature.

   Example current control string: '[$],[t,a]'.  In this case the .collapsig
   will have two signatures per document: Column 0 is the normal document
   signature and column 1 is a signature derived from the concatenation of
   metadata fields t and a, in that order.

21. padre-ct

Purpose: Report on the titles in a PADRE index.  Eventually to improve them.

Usage: padre-ct <index_stem> 

Warning: Development of this utility is not yet complete.

22. padre-cw

Purpose: To check the correctness of an index, compare two indexes, or display postings for a term within and index.

Usage0: padre-cw -v - print PADRE version
Usage1: padre-cw stem1 stem2 [-io] - Compare two indexes.
        -io means ignore diff.s in offsets into .idx
Usage2: padre-cw stem1 -show term - show postings for term.
        Also shows term before and afterward. (if applic.)
Usage3: padre-cw stem1 -check [-stemsuff] - Check index files for stem1 (default)
        use -stemsuff <suffix> to supply an additional suffix for the .idx and dct files

23. padre-di

Purpose: To display the metadata for documents in an index.  (main purpose)

Usage: padre-di <index_stem> [-check]|[-trecids]|[-metao [<docno>]][-meta [<pattern>] ] | [-metad [<pattern>] ]
-check - check whether the document table appears to be internally consistent

-trecid - make a mapping between trec DOCNO stored in title field and URL

-meta [<pattern>] - print title and metadata information for each document
                    whose "URL" contains pattern (case-insensitive)
                    If no pattern is given, all docs are shown, in collection order.

-metad          - as for -meta but show document numbers.

-metao          - as for -meta but show all documents, in collection order starting
                  from docno (default zero).

default - read in URLs and look them up, using sorted table


24. padre-do

Purpose: Print a permutation of the document numbers in an index corresponding to descending static score.

Usage: padre-do <stem> <docorderfile> [-deb] [-cool_param ...] 
Output is a list of docnums in descending order of cool score,
printed to docorderfile.  cool_param values are expected to lie in 0 - 1.
Default values are the same as for padre-sw though. Of course,
query-dependent cool values cool0, 7, 12, 15, 16, 17, 18, 19 are ignored
because there is no query.


25. padre-fl

Purpose: Display or operate on the document flags in an index.

Usage1: padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry

Usage2: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill

Usage3: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill

Usage4: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR

Usage5: padre-fl <index_stem> -kill-docnum-list <file_of_docnums>

Usage6: padre-fl -v

Note: Specify '-' as the file of url patterns to supply a single URL to standard input.


26. padre-gs

Purpose: Display or manipulate document gscope bits in an index.

Usage0: padre-gs -v|-V|-help   # print version info or detailed help
              on types of instructions and on program operation.
Usage1: padre-gs index_stem -clear   # clear all gscope bits

Usage2: padre-gs index_stem -show   # show all gscope bits

Usage3: padre-gs index_stem file_of_instructions [-separate] [other_bit_number]
        [-regex|-url|-docnum] [-verbose]

Where:
      * index_stem may also be the name of a collection
      * file_of_instructions may be '-' to accept instructions from stdin
      * -separate indicates that gscope changes should be made to a
        copy of the .dt file first, and then copied over the original file
        when changes are complete.
      * other_bit_number specifies a bit to be set on documents which
        end up with no bits set.
      * By default instruction patterns are expected to be regexes
        but this may be made explicit with -regex or altered with -url
        or -docnum.  Use padre-gs -help to obtain more information about
        instruction formats and pattern types.


27. padre-i4u

Purpose: Display aggregated information about a URL from a PADRE index.

Usage: padre-i4u -v | padre-i4u stem=<stem_or_collname> [fields=<alnum_string>] [debug=<int>] [iters=<int>] url=<url> ...
Note: The functionality is implemented by a dynamic library which is usually called directly.

28. padre-iw

FUNNELBACK_PADRE_15.1.1609.0-full MDPLFS (Web/Ent) $Revision: 42926 $ [64 bit]
Today is: 20160308 (according to the OS)
Purpose: Index a collection of documents

Usage1: padre-iw -V|-help|-ixform            (print version or help info.)
Usage2: padre-iw [-f|-tar|-reo<pf>] <dir>|<file>|<url> <filestem> [<option>|<tfdir> ...]
Usage3: padre-iw -secondary_update <dir>|<file> <filestem>
<pf> is a text file containing a permutation of the document numbers in the original index.
<dir> is a hierarchical directory of optionally gzipped files.
<file> contains a list of names of optionally gzipped files.
-f says that <file> is a single datafile to be indexed. For historic reasons
-tar means the same as -f.
     Files to be indexed may be tar or WARC files (optionally gzipped).
     Note that individual files in a tarfile are expected to be uncompressed.
     text.  Gzipped files, unfiltered PDFs etc. are not supported yet.
-reo says that <file> is the stem of a previous index to be
     reordered and reindexed.   <pf> is a text file containing
     a permutation of the document numbers in the original index.  Eventually,
     it may be possible to compute the permutation internally. For now, it
     must be specified via <pf>.
<filestem> prefixes the names of output files.
<tfdir> is the dir. in which tmp files will be writ.
-secondary_update creates a secondary index using the data directory specified, and using the options used in creating the primary index.

Available options:

A. Getting information about PADRE and its operation.
  -V       - Print PADRE version number and exit.
  -ixform  - Print index format version created by this indexer exit.
  -help    - Print this list and exit.
  -debug   - Generate debugging output.
  -show_each_word_indexed - For debugging.  Show each word occurrence (with field) as it is indexed.
  -show_each_word_to_file - For debugging.  Print each word occurrence (with field) to <filestem>.words_in_docs
  -hashlog - Create a .hashlog file with incremental hashing stats.
  -quiet   - Use terse logging.
  -ankdebug - Generate debugging output relating to anchortext.
  -termdeb<term> - print debugging messages relating to the indexing of <term>.

B. Controlling what is indexed.
  -nometa   - Don't index any metadata except t, d and k (titles, dates and links).
  -nomdsfconcat - Don't concatenate strings in the mdsf file. Record first only. (Others are still indexed.)
  -diwimuu - Don't index words in made-up URLs (those constructed from filepath).
  -dias    - Don't index link anchor source as part of source documents (<a> only).
  -ibd     - Index all documents even if they appear to be binary.
  -ixcom   - Index words in HTML and XML comments. 
  -select<num1>,<num2> - Index every num1th file/bundle starting from
            num2th(from zero).
  -select-doc-in-bundle=<interval>,<offset> - Index every <interval> document
            within a bundlestarting from <offset> (which starts at zero).
            Only works with warc store.
  -tarpat<regex> - Filenames in a tarfile being indexed must match regex. Default is match-everything.
  -csv=<fsep><skipfirst>[<quote>] - Files which are not clearly something else
                            are assumed to be CSV format.
                          fsep is ascii field separator, typically comma.
                            (tab is represented by t.)
                          skipfirst is either y or n, telling padre whether the first
                            line in a CSV file should be skipped.
                          quote is the character used to quote strings in fields
                            which may contain separators.  (You probably have to
                            escape it on the command line.)  If not specified,
                            no quote character is defined.  To include a quote
                            character within a quoted section, the quote may be doubled.
  -csv_fields=<comma_separated_descriptor_list> - This is a list of comma
                          separated descriptors describing how to index each column of
                          the csv file.
                          To index terms in a column as document text use '-'.
                          To index terms in a column as metadata use the format:
                               <metadata class name><content type>
                          To skip terms in a column use 'X'.
                           For example: 't1,-,X', would set the first column to title,
                          the second column would be indexed as document content and the third
                          column would b skipped.
                           Content-type defined in this argument should be the same as the content
                          type in metamap.cfg
  -check_url_exclusion=<on|off> - URLs matching url_exclusion pattern will not be searchable.  (Default on.)
  -url_exclusion_pattern=<regex> - exclusion pattern to use if URLs are vetted.  (Default 'file://$SEARCH_HOME/')
  -filepath_exclusion_pattern=<regex> - exclusion pattern to use if files are to be excluded from indexing
            on the basis of the filepath.  If applicable, this is more efficient than excluding by URL
            because the URL can't be finally determined until the content has been scanned. (Default: not set)
  -index_subversion_dirs - Normally the .svn directories created by
       the subversion version control system are not indexed. Override this default.

C. Controlling how things are indexed.
  -noax    - Don't conflate accents.
  -unimap=<mapname> - specify a Unicode mapping to be applied when indexing
                      and when query processing. Supported values: 
                      tosimplified, and totraditional. (Chinese only.)
  -deutsch=<i> - How much extra processing is done for umlaut and sz. 
             0 - none.  München is indexed as München and Munchen
             1 - München is indexed as München, Muenchen and Munchen (Dflt)
              2 - As for 1 but also Muenchen is indexed as München, Muenchen and Munchen
             (As a side-effect to allow for compounds, SORT_SIGNIF is increased to 40
  -nz=<i> -      How much extra processing is done for Māori. 
             0 - none.  Māori and Mäori are indexed as Māori or Mäori resp. and Maori (Dflt)
             1 - Māori is indexed as Māori, Maaori and Maori
                 Mäori is indexed as Mäori, Māori, Maaori and Maori
  -no_cjkt_grams - Suppress the indexing of bigrams/unigrams in CJKT text.  It is assumed that
                   said text has been pre-segmented into words, and that normal word-based indexing is needed.
  -QL_depth=<i> - Activate quicklinks on default pages of up to depth i. Use internal QL                  defaults.  (Dflt 0 = Off)
  -QL_config=<f> - Activate quicklinks. Read quicklinks configuration options from
                  file f.
  -docscan_depth=<i> - When trying to determine doc type and charset
                  indexer will look up to i char.s into the fdoc.  (Dflt 20480)
  -forcexml - Use the XML parser on all documents.
  -case    - Store case information in postings. Currently unsupported.
  -SORTSIG<num> - How many [UTF-8] characters in a word are significant. Default 20
  -dilw    - Don't index words or use words in summaries that are longer 
             than what is set by -SORTSIG.

D. Controlling metadata indexing.
  -XMF<file> - <file> specifies a file defining XML field mappings.
  -MMF<file> - <file> specifies a file defining meta tag mappings.
  -ifb       - Index a special word '$++' at the start and end of each metadata field (on by default).
  -noifb     - Do not index a special word '$++' at the start and end of each metadata field.
  -facet_item_sepchars=<string> - Which chars are used to separate metadata facet items.  [Dflt '|']
  -map[<f>]    - Map anchor text in source file to metafield f.  If <f> is absent,
                 outgoing anchortext is unfielded content. (dflt <f> absent)
  -EM<file>  - <file> is a file of external metadata.
  -NIM       - Ignore explicitly specified internal metadata.
  -collfield=<f> - Index the name of a collection as metadata in each doc and assign to field f.
  -collection_name= - Set the name of the collection being indexed.

E. Controlling link and anchortext handling.
  -noank_record - Don't extract, record or index anchortext.
                  - .anchors.gz file not processed.  No link counts possible.
  -noank_index    - Extract and record but don't index anchortext.
                  - .anchors.gz file can be post-processed by annie-a
  -noank      - Temporary synonym for -noank_index.  Deprecated.
  -dpdf    - Produce but don't process the anchors distilled file.
  -nep_action=<0|1|2>  - Action to take for nepotistic links. 
                  0 - treat the same as other links.
                  1 - ignore links of types greater than nep_limit.
                  2 - limit the number of repetitions of links of types
                      greater than nep_limit. (dflt)
  -nep_limit=<0|1|2|3>  - Ignore nepotistic links of types greater than the limit. 
                  0 - unaffiliated links from outside the target domain.
                  1 - links from a different host.
                  2 - links from the same or a closely affiliated host.
                  3 - dynamically generated links from such a host.
  -nep_cachebits=<i>  - Don't let the low-value link cache grow above 2^i
  -noaltanx - Don't index image alt as anchortext when an image is an anchor.
  -nosrcanx - Don't index image src as anchortext when an image is an anchor.
  -BL<f>   - <f> is a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist).
  -AD<f>   - <f> is a file of SECD (single entity controlled domain) affiliations.
             e.g. griffith.edu.au --> gu.edu.au
             Links to an affiliated SECD are classified as within-domain.
  -RP<f>   - <f> is a file of CGI parameters which should be removed from source and target URLs.
           - padre generates a regular expression from the lines in <f>.
           - if <f> is "conf_file" the regex be taken from crawler.remove_parameters;
             in the FunnelBack config file.
  -A<pat>  - <pat> is an acceptable link target pattern.
             - URLs not matching pat will not be stored in anchors.gz file.
             - if pat is "conf_file" pat will be taken from include_patterns
               in FunnelBack config file.
  -F<file>  - *<file> is an additional anchor text file.
  -FN<file>  - Like -F but source URLs should need not be looked up.
  -RD<dir> - *<dir> is a directory in which to look
         - for redirects and duplicates files.
         - (produced by FunnelBack etc. & PADRE).
  -igmaf   - Ignore main anchors file.
  -mule<n> - Discard links to URL targets longer than <n> chars. Default is no limit.
  -rmat    - Record targets of failed anchor lookups via stdout.

F. Controlling which index files are generated.
  -nomdsf  - Suppress generation of the .mdsf file.
  -nolex   - Suppress generation of the .lex file.
  -noqicf  - Suppress computation of QIC features and .qicf file.
  -nohostf - Suppress computation of host features and .ghosts file.
  -cleanup - Remove superfluous files from the index directory after index has completed.

G. Setting size limits.
  -GSB<n>  - How many gscope bytes to allow for.  Default/Min: 8/2.
  -big<N>    - Multiply word table sizes by 2^N from base of 256K.  Default table size is 8M (ie. -big5).
  -small   - Divide word table sizes by 4 from base of 256K (i.e. use 64K).
  -chamb<num> - Set decompression chamber size to <num> MB. Default 32
  -RSDTF<num> - Set maximum characters in description & title fields in .results to <num>.  Default 256.
  -RSTAG<num> - Set number of bytes to reserve for tags in .results to num. Default 0.
  -RSTXT<num> - Set maximum characters in summarisable text per doc in .results to <num>.  Default 50000.
  -W<num>  - Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM). Default for a 64bit system is 2800
  -MWIPD<num>- Maximum words indexed per document (excluding anchors). By default all words are indexed
  -maxdocs<num>- Maximum no. of documents to index. Others are ignored.
  -mdsfml<n> - Set maximum length for strings in .mdsf file. (dflt 2048 bytes)
  -99%    - Limit on how full the word hash table can get.

H. Special indexing modes.
  -duplicate_urls=flag|ignore (Default is flag.)
            - Documents whose URL checksum is identical to that of another document
              are normally flagged and suppressed from results.
  -urlchecksums=case_sensitive|case_insensitive (Default is case_insensitive).
  -paidads  - If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
  -doc_feature_regex=<Regex> - Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX.
             The presence or absence of this feature can be used in the ranking function, controlled by cool29 and cool30.
  -iolap   - Overlap reading of bundles with processing them.
  -utf8input - Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.)
  -isoinput - Assume all input files whose charset is not specified are ISO_8859-1 encoded.
  -force_iso - Forcibly assume all input files are ISO_8859-1 encoded.
  -URLP<str> - When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it's own URL with a BASE HREF element, such as in local collections)
  -lmd       - HTTP LastModified date takes priority over metadata dates.
  -lmd_never - Completely ignore HTTP LastModified dates.
  -future_dates_ok - If this option is not given document dates in the future will not ignored
                     in determining the reference point for document recency calculations.
  -DT<str>   - Interpret <str> as start of new doc within bundle. (Not a regular expression).
               (note that there is a separate mechanism for XML).
  -annie[<exec>] - After normal indexing is complete, attempt to build an annotation index (annie)
                   and a spelling suggestion file.
                   Default executables are annie-a and build-spelling-index from whence padre-iw was run.
  -speller[<exec>] - Allows the explicit specification of a spelling_index builder to run after annie-a.
  -spelleroff - turns of spelling-index building even if annie-a runs.
  -spelling_threshold<i> - Annotations with fewer than i occurrences will not be considered
                           as spelling suggestions.  (dflt 1)
  -bigweb  - Space saving option for bundled large crawl indexes. Roughly equivalent to:
             -nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2 
             -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet
              * A shorter average wordlength is assumed.
              * You can add e.g. -Axxx.com to cut anchor processing time.
              * (Don't forget to make dupredrex.txt in index directory.)

I. Miscellaneous options.
  -O<name> - <name> is the name of this organisation.
  -T<path> - Specify a large temporary filespace for use by the indexer.
  -redis_host=<str> - Hostname/IP of a Redis server where progress status should be written
  -redis_port=<i> - Port of the Redis server. Default is 6379

S. Security options.
  -security_level=<i> - Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop.
  -security_mindocs=<i> - Must be at least this number of docs with at least one lock.

     *** See also url_exclusion options in Section B above.


29. padre-mi

Purpose: To merge a list of PADRE indexes into a single such index.

Usage: padre-mi outstem instem instem ... [-overwrite]
   -overwrite overrides protection against destroying existing outstem

Make a merged index (outstem) from the list of at least two input indexes.

This version assumes that input indexes have exactly the same format,
i.e. that the index format strings are the same and that they have
identical numbers of gscope bits, numerical metadata fields and so on.
Future versions may check this compatibility, but currently exact compa-
tibility is assumed.  All manner of pestilence may descend upon you if
you use padre-mi on incompatible indexes.  You have been warned :-)


30. padre-qi

Purpose: To setup a query-independent-evidence file for use in query processing.

Usage: padre-qi index_stem file_of_url_patterns dflt_score [profile_name] [-verbose]

 - if a profile name is given, qiefile will be stem.qie_profile

Each URL in the index is matched against the patterns, in the
order in which they are listed in the pattern file.  Once a match
is found, matching ceases for that URL.  This behaviour can be
exploited to apply a general pattern (later in the file) if
no more specific pattern (earlier in the file) matches.
  To achieve exact matching use ^ (matches start of URL) and
$(matches end of URL

Lines in the patterns file consist of:

<qie score> <url-pattern>

qie-score  - a floating point number (assumed normalised to the range 0-1), 
             specifying the qie score to be applied.

url-pattern   - a perl5 regular expression to be matched against name
                 strings in the .urls file (usually URLs).

Example:
0.25  ^(https://)?[^/]*nsw.gov.au/
1.0   ^(https://)?[^/]*wa.gov.au/
0.25  ^(https://)?[^/]*sa.gov.au/
0.25  ^(https://)?[^/]*nt.gov.au/


31. padre-qs

Purpose: To generate query suggestions given an index and a partially typed query.

Usage: padre-qs -v | padre-qs stem=<stem>|collection=<collname> partial_query=<partial_query> [alpha=<f>] [show=<d>] [fmt=xml|json|json++] [callback=foo] [sort=0|1|2] [profile=<profile>] [debug=0|1|2|3], e.g.
Note: The functionality is implemented by a dynamic library which is usually called directly.
padre-qs stem=/opt/funnelback/data/abc/live/idx/index partial_query=kevi alpha=0.5 show=10
  - sort=0 (by weight), 1 (by length), 2 (in alphabetic order), 3 (by weighted combo of weight and length).
  - fmt=json => simple JSON array of suggestion strings;
       =json++ => full JSON object with all fields shown.
  - callback=foo => In JSON or JSON++ output will wrap 
         the response with the supplied callback (for JSONP).
  - show=<d> => how many suggestions to show.
  - alpha=<f> => if sort=3, score = alpha * weight + (1 - alpha) * length_score.

32. padre-rf

Purpose: Generate a relevance-feedback query given a list of relevant documents in a collection.  Powers the Explore feature.

Usage: padre-rf -v | padre-rf -idx_stem=<index_stem> [-exp=<7..50>] [-comp=<comp_num> -dox=<docnum_list> | -url=<url>] ...

Details of available options:
R. -collection=<X>                      - The name of a collection, either meta 
                                          or primary. 
R. -script=<S>                          - Name of the CGI script to which 
                                          padre-rf.cgi should redirect. (dflt "(null)")
R. -idx_stem=<Y>                        - The index stem for this collection, 
                                          either meta or primary.  [Not CGI]
R. -exp=<I>                             - Maximum complexity of generated query 
                                          (no. of words). (range 7 - 50) (dflt 10)
R. -deb_rf=<I>                          - Activate debugging output.  Higher 
                                          values give more verbose output. (range 0 - 10) (dflt 0)
R. -comp=<I>                            - Component number within a meta 
                                          collection. (range 0 - unlimited) (dflt 0)
R. -dox=<D>                             - Comma separated list of document 
                                          numbers within current component. 
R. -url=<E>                             - URL of document to be included in 
                                          generation of RF query. 

33. padre-show

Purpose: Display the contents of a padre index file in readable format.

Usage: padre-show <padre_index_file>
 -- if poss. displays contents of index file in text form.

 e.g. padre-show index.urls


34. padre-sk

Purpose: Create a skip block index from a regular padre index.

Usage: padre-sk <stem> <skip>

  Output will be in <stem>.idx_skip and <stem>.dct_skip

  <stem>   String: the index stem to use
  <skip>   Integer: the minimum number of postings between each skip block

35. padre-sr

Purpose: Display all or part of the content of the .results file. (Title, URL, Description metadata, and candidate sentences for summary generation.

Usage: padre-sr stem|results_file [-titleonly] [-unco] [-ifff|-embedded|-text|-html|-textsigs]
      [starting_doc|starting_url] [urlpat=<regex>] [num_docs_to_show]

padre-sr sequentially reads the .results file and outputs all or part of the file to stdout in a choice of formats:

     . html (default)
     . embedded (incomplete html suitable for embedding in another html document)
     . text
     .textsigs (generate stem.textsigs file suitable for neardup detection.)
If -titleonly is given only the document titles are output. (not applic. to textsigs)

Use -unco to specify that the input doc. is in old uncompressed format.

If a starting document number or URL is given, output commences
only when that point in the file is reached.  Output continues
to the end of the file unless num_docs_to_show is given.
If urlpat= is given, only documents whose URL matches the pattern are
considered for display.  Case-sensitive unless specified otherwise
in the pattern.  Don't include 'http://' in the pattern. 


36. padre-sw

Purpose: Process queries using a PADRE index.

Usage: padre-sw <filestem> [option ...]
<filestem> - the common prefix of all the index files, or possibly
             the name of a Funnelback collection.

Available options:

A. Getting information about PADRE and its operation.
  -V       - Print version number and exit.
  -ixform  - Print index format version expected by this
             query processor and exit.
  -help    - Print this list.

Notation:
---------
 <B> - Boolean.  Will be interpreted as TRUE unless arg is 'off',
       'false' or '0' (case insensitive).
 <I> - Integer.  eg. 7 or 100000. Whole number in specified range.
 <F> - Number.  e.g. 1 or 0.537 or 99.5.  Some inputs of this type
       are constrained to lie within [0.0 - 1.0].
 <C> - Character.  e.g. a or A or : A single character.
 <S> - String.  eg. abc or "a b c".  Quotes needed around the
       key and value if spaces or punctuation included - for
       example: "-optionname=a b c".
 <K> - Key/value pair. These options take a key and a value,
       for example, -optionname.KEY=VALUE

I. Contextual navigation options:

   -categorise_clusters=<B>             - Whether contextual navigation 
                                          suggestions are grouped by type. 
   -cnto=<F>                            - Set contextual navigation time-out to s 
                                          seconds (s floating point). processing 
                                          may be omitted entirely if elapsed time 
                                          for a query already exceeds s seconds. 
                                          (dflt 5.0). (range 0.000000 - unlimited)
   -contextual_navigation=<B>           - Whether or not to activate the 
                                          contextual navigation system. 
   -contextual_navigation_fields=<S>    - String s lists the metadata fields, 
                                          separated by commas surrounded by 
                                          square brackets, to scan for contextual 
                                          navigation suggestions. (dflt '[c,t]'). 
                                          Note that scanning of document text can 
                                          be suppressed by including a minus, for 
                                          example '[-,c,t]'. 
   -max_phrase_length=<I>               - Maximum length (in words) of contextual 
                                          navigation suggestions. (range 3 - 7)
   -max_phrases=<I>                     - After this number of candidate phrases 
                                          have been checked, contextual 
                                          navigation processing will stop. (range 0 - unlimited)
   -max_results_to_examine=<I>          - Maximum number of search results to 
                                          scan for contextual navigation 
                                          suggestions. (range 0 - 200)
   -site_max_clusters=<I>               - Maximum number of site clusters to 
                                          present in contextual navigation. (range 0 - unlimited)
   -topic_max_clusters=<I>              - Maximum number of topic clusters to 
                                          present in contextual navigation. (range 0 - unlimited)
   -type_max_clusters=<I>               - Maximum number of type clusters to 
                                          present in contextual navigation. (range 0 - unlimited)
J. Geospatial options:

   -geospatial_ranges=<B>               - Calculate geospatial distance from 
                                          origin and bounding box ranges when 
                                          geospatial data is configured and 
                                          available. 
   -maxdist=<F>                         - Exclude results not within <f> km of 
                                          origin. (range 0.000000 - unlimited)
   -origin=<S>                          - <lat,long> Set origin to lat, long 
                                          (floating point degrees). 
K. Informational options:

   -canq=<B>                            - Write reordered queries to log.  (dflt 
                                          off) 
   -count_dates=<S>                     - Report facet counts for dates such as 
                                          'today', 'last week', 'this year'. Note 
                                          that date categories may overlap.  Only 
                                          value currently supported is 'd'. 
   -count_urls=<I>                      - Display counts of results grouped by 
                                          the URL path (Up to depth i). If <I> is 
                                          not present or 0, then the default 
                                          value is used. Dflt 5.  [Not CGI]
   -rmcf=<S>                            - Metadata fields to have their words 
                                          counted in result sets (fields 
                                          representing facets). To count fields 
                                          'a' and 'c', set this to '[a,c]'. 
   -rmrf=<S>                            - Numerical and geospatial fields listed 
                                          will have their ranges calculated in 
                                          result sets. To see the ranges of field 
                                          'height' and the bounding box 
                                          geospatial field 'X' set this to 
                                          '[height,X]'. 
   -showtimes=<B>                       - Print elapsed times for each stage of 
                                          query processing. 
L. Logging options:

   -ip_to_log=<S>                       - What form of ip to include in log 
                                          files: (nothing|ip|ip_hash|remote_user). 
   -log=<B>                             - Write query log entries (dflt on).  [Not CGI]
   -qlog_file=<S>                       - If writing query log entries, write 
                                          them to <FILE>.  [Not CGI]
   -username=<S>                        - A string identifying the current user 
                                          to be used in padre's query log. 
M. Miscellaneous options:

   -countgbits=<S>                      - s is either "all" or a comma-separated 
                                          list of gscope bitnumbers for which 
                                          counts are needed. (Bits numbered from 
                                          zero.) 
   -exit_on_bad_component=<B>           - Fail when a component has an 
                                          incompatible index relative to the 
                                          first (rather than skip). 
   -flock=<B>                           - Use flock when locking the query 
                                          logfile.  If set to no, lockf is used 
                                          instead. Default on Solaris is 'no', 
                                          all other systems 'yes'. 
   -mat=<I>                             - Set matchset size to n million (dflt 
                                          24). Only need to increase on very 
                                          large collections. (range 0 - 2147) [Not CGI]
   -ndt=<B>                             - Don't do tests on docs, e.g. phantom, 
                                          zombie, *scope, binary, expired.  [Not CGI]
   -unbuf=<B>                           - Don't buffer the standard output 
                                          stream. In some specific cases, setting 
                                          this to 'no' can improve performance. 
   -view=<S>                            - The collection view the perform the 
                                          query against when in CGI mode. 
                                          Normally 'live' (default), 'offline' or 
                                          'snapshot###'. 
N. Presentation options:

   -EORDER=<I>                          - Specify presentation order of query 
                                          biased summary excerpts. 0: natural 
                                          order in doc. 1: sorted by score. (dflt 
                                          0) (range 0 - 1)
   -MBL=<I>                             - Set buffer length per displayed 
                                          metadata field to n bytes (dflt 250 
                                          bytes). Warning: setting very large 
                                          values will increase query processor 
                                          memory demands and may cause problems. (range 1 - unlimited)
   -SBL=<I>                             - Set summary buffer length to n bytes.  
                                          (dflt 250 bytes) (range 1 - unlimited)
   -SF=<S>                              - Metadata fields to include in 
                                          summaries. (if applicable). To include 
                                          fields 'a' and 'd' set this to '[a,d]'. 
   -SHLM=<I>                            - Select highlighting method within 
                                          snippets in XML. 0 - No highlighting ; 
                                          1 - HTML strong tags ; 2 - Show 
                                          highlighting regexp. and unhighlighted 
                                          summary  [dflt];  5 - Use HTML strong 
                                          tags but remove accents from summary 
                                          before highlighting, provided query was 
                                          not accented. (range 0 - 7)
   -SM=<S>                              - Summary mode 
                                          (off;snip;debug;meta;qb;def;auto;both) 
                                          - both means qb and meta. 
   -SQE=<I>                             - Set max no. of query biased summary 
                                          excerpts to n  (dflt 3). (range 1 - 10000)
   -bb=<B>                              - If set, the query processor will may 
                                          insert "best bets" (formerly known as 
                                          "featured pages") suggestions from 
                                          best_bets.cfg. 
   -ctest_mode=<I>                      - Controls behaviour of padre-sw when 
                                          -ctest is used.  0: no internal 
                                          evaluation;  1 - internal evaluation 
                                          only.  Output is brief plain text 
                                          report of measures;  2 - internal 
                                          evaluation only. Output in plain text 
                                          with QBQ output followed by measures;  
                                          3 - internal evaluation plus normal 
                                          CTOUT output in XML (with measures 
                                          presented as comments) (range 0 - 3)
   -explain=<B>                         - Explain rankings by showing score 
                                          components. (Note that -explain=on 
                                          turns off result set diversification). 
   -explore=<I>                         - Show 'explore' links against results.  
                                          The value specifies how many terms to 
                                          include in the expanded query. (range 7 - 50)
   -gscoperesult=<S>                    - Specifies the bit number that results 
                                          will be set to in -res gscope or -res 
                                          docnums modes (dflt 1). 
   -mdsfhl=<B>                          - Are query terms only highlighted in 
                                          MDSF metadata summaries 
   -num_ranks=<I>                       - Limit number of results to n (min = 0, 
                                          dflt = 10). (range 0 - unlimited)
   -num_tiers=<I>                       - Limit number of result list tiers to n 
                                          (min = 0 (no ,limit), max = 50, dflt no 
                                          limit) (range 0 - 50)
   -qieval=<F>                          - Set the value presented for query 
                                          independent evidence when using the 
                                          qiecfg result format. (dflt 0.5). (range 0.000000 - 1.000000)
   -qwhl=<S>                            - Determines which parts of a search 
                                          result are highlighted.  S - snippet, M 
                                          - metadata, U - URL, T - title.  E.g. 
                                          -qwhl=MUT 
   -res=<S>                             - Set result format. Possible values are: 
                                           trec, mail, web, html, xml, urls, 
                                          qiez, qieo, gscope, docnums, ctest, 
                                          qiecfg or flcfg. 
   -results_in_facet_categories=<I>     - Include the specified number of 
                                          pre-computed search results under the 
                                          rmc count element for metadata facet 
                                          categories. (range 0 - 100)
   -rmc_maxperfield=<I>                 - Set maximum number of RMC items to 
                                          display per field at n (dflt 100). (range 0 - unlimited)
   -rmc_sensitive=<B>                   - Treat facet categories (RMC items) case 
                                          sensitively (default no).  [Not CGI]
   -show_qsyntax_tree=<B>               - Include an SVG representation of the 
                                          query-as-processed in output. 
   -start_rank=<I>                      - Present results starting from n (dflt 
                                          1). (range 1 - unlimited)
   -tierbars=<B>                        - Display tierbars in result list output 
                                          (XML and HTML). When turned on (for all 
                                          -res modes) and -sort is used, results 
                                          will be first sorted by tier then by 
                                          the sorting mode, otherwise if -sortall 
                                          is used then all results will be sorted 
                                          regardless of tier. 
O. Query interpretation options:

   -STOP=<S>                            - Use the stoplist specified in <file> 
                                          (one word per line)  [Not CGI]
   -binary=<I>                          - Determines whether or not binary 
                                          documents are returned in the results.  
                                          0 - show all documents;  1 - show only 
                                          binary documents;  2 - show only 
                                          non-binary documents. (range 0 - 3)
   -clive=<S>                           - Dynamic metacollections.  Specifies the 
                                          number (from 0) of one component within 
                                          the .sdinfo file(s) to make active. 
   -daat_termination_type=<I>           - Selects how DAAT early exit is 
                                          determined.  0 - try for d results with 
                                          every metafield and every component;  1 
                                          - try for d results over every 
                                          component but not necessarily every 
                                          metafield;  2 - stop a soon as d 
                                          results are obtained.  (d is the 
                                          parameter to -daat.) (range 0 - 2)
   -daat_timeout=<F>                    - Impose a soft timeout (in seconds) on 
                                          the time taken by the DAAT machinery 
                                          for one query. (range 0.000000 - 3600.000000) [Not CGI]
   -dont_estimate_full_matches=<B>      - In DAAT mode don't guess the number of 
                                          full matches when the DAAT depth did 
                                          not let us processes an entire postings 
                                          list. 
   -events=<B>                          - Must be set if event search is to be 
                                          used 
   -fmo=<B>                             - Present full matches only. 
   -lang=<S>                            - If a 2-character language code is 
                                          specified by this means, then stemmers 
                                          etc specific to that language will be 
                                          used, IF AVAILABLE.  It is also 
                                          permissible to use a 5-character code 
                                          like en_GB, but padre behaviour will be 
                                          the same as for en.  Specifying lang 
                                          also makes title and metadata sorting 
                                          of results locale-specific, however 
                                          support for this on Windows platforms 
                                          is limited and problematic. 
   -loose=<I>                           - Phrase looseness in words (min = 0, 
                                          dflt = 0). (range 0 - unlimited)
   -max_qbatch=<I>                      - Terminate batch query processing after 
                                          the specified number of queries have 
                                          been processed. (range 1 - unlimited)
   -max_terms=<I>                       - Truncate queries after the specified 
                                          number of terms.  If the query is 
                                          reordered, truncation will occur after 
                                          reordering. (range 1 - unlimited)
   -min_truncated_len=<I>               - The text part of a query term with a 
                                          right truncation operator must have at 
                                          least this length.  E.g. if 
                                          min_truncated_len were 4 funnel* would 
                                          be accepted but fun* would be processed 
                                          as fun. (range 0 - 20) [Not CGI]
   -noexpired=<B>                       - Exclude expired docs from results.  
                                          (Nullified by -zom)  [Not CGI]
   -nulqok=<B>                          - An empty query submitted via CGI will 
                                          be processed as a null query. The 
                                          system query must be empty as well.  
                                          (dflt is to ignore the request).  [Not CGI]
   -phrase_prox_word_limit=<I>          - Phrase or proximity terms with more 
                                          than this number of words will be 
                                          shortened by deleting words from the 
                                          right.  E.g. If this limit were 4 then 
                                          `to be or not to be` would be processed 
                                          as `to be or not` (range 1 - unlimited) [Not CGI]
   -prox=<I>                            - Proximity limit in words (min = 0, dflt 
                                          = 15). (range 0 - unlimited)
   -qsup=<S>                            - When blending queries, determines 
                                          sources of supplementary queries to be 
                                          tried, with corresponding weights 
                                          assigned to each source (ranging from 0 
                                          to 1).  No spaces. 'off' may be 
                                          specified to disable supplementary 
                                          queries.  E.g. 
                                          -qsup=SPEL/0.9+USUK/0.4+SYNS/0.1+LANG/0.

                                          (spelling suggestions); USUK (table of 
                                          spelling differences between US and UK 
                                          English); SYNS (synonyms as defined by 
                                          the blending.cfg file); LANG 
                                          (experimental German decompunding)  
   -query_reorder=<B>                   - Reorder terms in query so that the most 
                                          discriminating (least common) appear 
                                          first. Often coupled with -max_terms=  
   -ras=<I>                             - Remove any stopwords from the query. 
                                          Possible values: 0 - remove none; 1 - 
                                          remove dynamically depending on the 
                                          query; 2 - remove all stopwords (dflt 
                                          1). (range 0 - 2)
   -service_volume=<S>                  - Either 'high' or 'low'.  A convenience 
                                          setting to increase or reduce allowable 
                                          query complexity and timeouts according 
                                          to service volumes -- large or small 
                                          indexes, high or low query volumes.  [Not CGI]
   -stem=<I>                            - Controls stemming of queries. 0 - do 
                                          not stem (dflt); 1 - do not stem 
                                          (replaces obsolete option); 2 - Stem 
                                          all query words (light - English/French 
                                          plural/singular only); 3 - Stem all 
                                          query words(heavier). (range 0 - 3)
   -stem_lconly=<B>                     - When stemming, stem only lowercase 
                                          query words (to avoid stemming proper 
                                          names and acronyms).  
   -strip_invalid_utf8=<B>              - Normally, invalid UTF-8 characters are 
                                          removed during indexing.  If this 
                                          hasn't happened. This option allows 
                                          them to be removed from result packets. 
   -synonyms=<B>                        - If set, the query processor will expand 
                                          queries using thesaurus in synonyms.cfg. 
   -truncation_allowed=<I>              - Enables the use of the * operator, 
                                          binary valued, it is only valid in use 
                                          with an option that disables DAAT mode 
                                          such as, -service_volume='lo' or 
                                          -daat=0. When applied all contexts are 
                                          available such as: *:funnelback, 
                                          funnel*, *back, and *:*elba*. (range 0 - 3) [Not CGI]
   -wildcard_thresh=<I>                 - If the postings list for a term is 
                                          longer than the specified value (in MB) 
                                          it will be treated as a wildcard. (range 0 - unlimited)
   -zom=<B>                             - Include docs in results even if noindex 
                                          or killed. 
P. Query source options:

   -ctest=<S>                           - Read a batch of queries from testfile 
                                          (in C_TEST format). Sets output format 
                                          to RM_CTEST, but that may be 
                                          overridden. (See es.csiro.au/C-TEST/ 
                                          for information about C-TEST.)  [Not CGI]
   -s=<S>                               - System-generated query inserted behind 
                                          the scenes by a form or front-end. 
Q. Quicklinks options:

   -QL=<I>                              - Activate QuickLinks facility for 
                                          default pages down to the specified 
                                          level. 0 - off;  1 - server root pages; 
                                          2 - next level down. (range 0 - 5)
   -QL_rank=<I>                         - If QuickLinks capability is active, 
                                          show quick links for search results 
                                          down to the specified rank. (range 1 - unlimited)
   -QL_rank_is_relative=<B>             - If true, the value of QL_rank will be 
                                          interpreted relative to the start_rank. 
                                          E.g. if QL_rank=2, the first two 
                                          results on each page may show 
                                          QuickLinks. 
R. Ranking options:

   -SSS=<I>                             - Same site suppression depth: 0 - no 
                                          suppression (dflt for non-web 
                                          collections.);  2 - hosts and their top 
                                          level dir's (dflt for web and meta 
                                          collections; 10 - special meaning for 
                                          big Web applications. (range 0 - 10)
   -SameSiteSuppressionExponent=<F>     - Same site suppression penalty exponent 
                                          (dflt 0.5, recommended range 0.2 - 0.7). (range 0.000000 - unlimited)
   -SameSiteSuppressionOffset=<I>       - Number of additional documents from a 
                                          site beyond the first that are allowed 
                                          their full score before applying a same 
                                          site suppression penalty (dflt 0) (range 0 - 1000)
   -absscores=<B>                       - Report content scores as % of max 
                                          possible Okapi score (Intended for use 
                                          with -vsimple=on). 
   -anniemode=<I>                       - Control the use of annotation indexes. 
                                          0 - do not use annotation indexes ; 1 - 
                                          Process queries using annotation 
                                          indexes only; 2 - Process queries using 
                                          annotation indexes, falling back to 
                                          normal indexes if insufficient results. 
                                           (Most query op.s stripped.) 3 - 
                                          Process queries using both annotation 
                                          and normal indexes (Most operators 
                                          stripped from queries.). Default 0. (range 0 - 3)
   -b=<F>                               - Set Okapi b to f (dflt 0.75) (range 0.000000 - unlimited)
   -cgscope1=<S>                        - Documents matching this gscope 
                                          expression (reverse Polish) can be 
                                          upweighted with -cool68.  Those not 
                                          matching can be upweighted with 
                                          -cool.70. 
   -cgscope2=<S>                        - Documents matching this gscope 
                                          expression (reverse Polish) can be 
                                          upweighted with -cool69.  Those not 
                                          matching can be upweighted with 
                                          -cool.71. 
   -cool=<B>                            - Whether to use topic distillation 
                                          scoring (cool and cooler). Dflt on. 
   -cool.<K=V>                            - cool.N=V Set a value for the Nth tune 
                                          parameter. 
               Possible values for N are:
        0 | content: content weight
        1 | onlink: onsite link weight
        2 | offlink: offsite link weight
        3 | urllen: URL length weight
        4 | qie: external evidence (qie) weight
        5 | recency: recency weight
        6 | urltype: URL attractiveness (Homepages favoured. Copyright pages and URLS with lots of punctuation deprecated.)
        7 | annie: annotation weight (annie)
        8 | domain_weight: weight associated with this domain
        9 | geoprox: geographical proximity to origin
       10 | nonbin: non-binariness (1 for html, xml, txt, 0 otherwise)
       11 | no_ads: freedom from ads
       12 | imp_phrase: implicit phrase match score
       13 | consistency: consistency of evidence.  (Extra reward for docs with non-zero scores on both content and annie.)
       14 | log_annie: logarithm of annotation weight (log(annie))
       15 | anlog_annie: absolute-normalised logarithm of annotation weight.   
       16 | annie_rank: annotation rank = (k - rank)/ k.  where k = 2 x highest rank requested - if rank > k, rank = k
       17 | BM25F: field-weighted Okapi score
       18 | an_okapi: absolute-normalised Okapi score.
       19 | BM25F_rank: field-weighted Okapi rank.
       20 | mainhosts: bias in favour of principal servers (web search only).
       21 | comp_wt: component collection weighting. (meta collections only).
       22 | document_number: document number in the crawl. An early position in the crawl may correlate with importance
       23 | host_incoming_link_score
       24 | host_click_score
       25 | host_linking_hosts_score
       26 | host_linked_hosts_score
       27 | host_rank_in_crawl_order_score
       28 | host_domain_shallowness_score
       29 | doc_matches_regex: document matches administrator supplied regex
       30 | doc_does_not_match_regex: document does not match administrator supplied regex
       31 | titleWords: number of words in title
       32 | contentWords: number of indexed words in document
       33 | compressionFactor: compressibility of document text
       34 | entropy: entropy of document
       35 | stopwordFraction: fraction of stopwords in the document
       36 | stopwordCover: fraction of stopword list present in the document
       37 | averageTermLen: average term length
       38 | distinctWords: number of distinct words in the document
       39 | maxFreq: frequency of most frequently occurring term
       40 | titleWords_neg: Neg number of words in title
       41 | contentWords_neg: Neg number of indexed words in document
       42 | compressionFactor_neg: Neg compressibility of document text
       43 | entropy_neg: Neg entropy of document
       44 | stopwordFraction_neg: Neg fraction of stopwords in the document
       45 | stopwordCover_neg: Neg fraction of stopword list present in the document
       46 | averageTermLen_neg: Neg average term length
       47 | distinctWords_neg: Neg number of distinct words in the document
       48 | maxFreq_neg: Neg frequency of most frequently occurring term
       49 | titleWords_abs: Abs number of words in title
       50 | contentWords_abs: Abs number of indexed words in document
       51 | compressionFactor_abs: Abs compressibility of document text
       52 | entropy_abs: Abs entropy of document
       53 | stopwordFraction_abs: Abs fraction of stopwords in the document
       54 | stopwordCover_abs: Abs fraction of stopword list present in the document
       55 | averageTermLen_abs: Abs average term length
       56 | distinctWords_abs: Abs number of distinct words in the document
       57 | maxFreq_abs: Abs frequency of most frequently occurring term
       58 | titleWords_abs_neg: Abs number of words in title
       59 | contentWords_abs_neg: Neg abs number of indexed words in document
       60 | compressionFactor_abs_neg: Neg abs compressibility of document text
       61 | entropy_abs_neg: Neg abs entropy of document
       62 | stopwordFraction_abs_neg: Neg abs fraction of stopwords in the document
       63 | stopwordCover_abs_neg: Neg abs fraction of stopword list present in the document
       64 | averageTermLen_abs_neg: Neg abs average term length
       65 | distinctWords_abs_neg: Neg abs number of distinct words in the document
       66 | maxFreq_abs_neg: Neg abs frequency of most frequently occurring term
       67 | lexical_span_score
       68 | doc_matches_cgscope1: Documents which match gscope defined by -cgscope1 (if defined)
       69 | doc_matches_cgscope2: Documents which match gscope defined by -cgscope2 (if defined)
       70 | doc_does_not_match_cgscope1: Documents which do not match gscope defined by -cgscope1 (if defined)
       71 | doc_does_not_match_cgscope2: Documents which do not match gscope defined by -cgscope2 (if defined)
       72 | raw_annie: Untransformed annie score linealry scaled to 0..1
   -daat=<I>                            - Specifies the maximum number of full 
                                          matches for Document-At-A-Time 
                                          processing. If set to 0, Term-At-A-Time 
                                          is used instead (dflt 5000). (range 0 - 10000000)
   -diversity_rank_limit=<I>            - Diversification won't alter ranking 
                                          beyond rank n (default 200, min 10). (range 10 - unlimited)
   -gscope1=<S>                         - Present only results whose gscope bits 
                                          match reverse Polish expression e (Bits 
                                          numbered from zero). If set to 'off', 
                                          disable any previous expression. 
   -k1=<F>                              - Set Okapi K1 to <f>. (dflt 2.0) (range 0.000000 - unlimited)
   -kmod=<I>                            - Select special scoring function i for 
                                          special fields.  0 = normal, 1 = AF1 
                                          (dflt 1). (range 0 - 1)
   -lscope=<S>                          - Present only results whose URL matches 
                                          a sort-of left-anchored pattern. 
   -lscorrect=<B>                       - Whether to correct link scores across 
                                          meta collection components (default 
                                          yes). 
   -main_homepage_factor=<F>            - Penalise score of the homepage of a 
                                          single-entity-controlled domain to 
                                          prevent over representation in results 
                                          sets.  E.g. www.anu.edu.au/ in an index 
                                          of ANU. (range 0.000000 - 1.000000)
   -meta_suppression_field=<S>          - If same_meta_suppression is activated, 
                                          the specified metadata field will be 
                                          the field to which it applies.  Only 
                                          one metadata field can be treated in 
                                          this way. 
   -near_dup_factor=<F>                 - The query processor will penalise a 
                                          result which is a near-duplicate of a 
                                          previous result  by multiplying by the 
                                          factor specified. The penalty stiffens 
                                          with more repetition. (range 0.000000 - 1.000000)
   -neardup=<F>                         - Near dupulicates in ranking are 
                                          multiplied by f. f=1 turns off near-dup 
                                          detection. (range 0.000000 - 1.000000)
   -promote_urls=<S>                    - Insert the specified URLs at or near 
                                          the top of the results list for a 
                                          query.  Value is a space separated list 
                                          of URLs.  URLs must correspond to those 
                                          recorded by padre-iw.  (dflt Inactive) 
   -quanta=<I>                          - Set the number of possible score 
                                          quantisation levels for each cool 
                                          variable.  In general, a high number 
                                          should give more accurate ranking but 
                                          may slow query processing. (range 10 - 100000)
   -rank_limit=<I>                      - Limit highest rank requestable to n 
                                          (dflt 1,000,000,000). (range 10 - unlimited)
   -ranking_profile=<I>                 - Choose a profile of settings for the 
                                          ranking function.  0 - current default; 
                                          1 - Standard BM25; 2 - Traditional 
                                          (pre-12.0) Funnelback.  Setting a 
                                          profile does not override explicit 
                                          settings. (range 0 - 100) [Not CGI]
   -recency_decay_vals=<S>              - <z,w,m,y,d,c,m> - Define how recency 
                                          scores decay with time. z w, m, y, d, 
                                          c, m are floats in the range 0 - 1,  
                                          which specify the recency score 
                                          assigned to documents, 0 days, 1 wk, 1 
                                          mth, 1 yr, 1 dec, 1 cen, 1 mill. old. 
                                          (dflt 1.0,0.75,0.5,0.25,0.025,0.0025) 
                                          Recency scores between key values 
                                          linearly interpolated. Past the 
                                          millennium, recency scores are 
                                          1/daysold. 
   -reference_date=<S>                  - If specified, recency is based on this 
                                          date rather than that of most recent 
                                          doc. Format is <yyyymmdd>, or 'today'. 
   -remove_urls=<S>                     - Prevent the specified URLs from 
                                          appearing in the results for a query.  
                                          Value is a space separated list of 
                                          URLs.  URLs must correspond to those 
                                          recorded by padre-iw.  (dflt Inactive) 
   -repetitiousness_factor=<F>          - Penalise a repetitious result by 
                                          multiplying by the factor specified. 
                                          (Repetitiousness may involve same-site, 
                                          same component or repeated metadata.) 
                                          The penalty stiffens with more 
                                          repetition. (range 0.000000 - 1.000000)
   -same_collection_suppression=<F>     - While searching a meta-collection, 
                                          penalise the second result from the 
                                          same primary collection as a previous 
                                          result by multiplying by the factor 
                                          specified. The penalty stiffens with 
                                          more repetition. (range 0.000000 - 1.000000)
   -same_meta_suppression=<F>           - Penalise the second result with the 
                                          same value for a specified metafield as 
                                          a previous result by multiplying by the 
                                          factor specified. The penalty stiffens 
                                          with more repetition. (range 0.000000 - 1.000000)
   -sco=<S>                             - <n>[<classes>] Set doc scoring mode to 
                                          n, using the classes specified.  Most 
                                          common values: 0 - score using doc text 
                                          only ;  1 - no scoring.  Produce an 
                                          unordered set of results ; 2 - score 
                                          using anchortext and URLs as well, 
                                          upweight titles (or whatever fields are 
                                          configured with -specf). For example to 
                                          automaticall look in fields 'u' and 'v' 
                                          for the query terms set -sco=2[u,v] 
   -scope=<S>                           - Present only results whose URL 
                                          satisfies the include/exclude scopes 
                                          included in list (comma separated). 
                                          e.g. 
                                          -scope=anu.edu.au,-anu.edu.au/archives 
   -sort=<S>                            - Sort top results by <string>. Possible 
                                          values: 'date', 'adate' (ascending 
                                          date), 'title', 'dtitle' (descending 
                                          title), 'size' (file size), 'dsize' 
                                          (descending filesize), 'url', 'durl' 
                                          (descending url), 'coll' (collection 
                                          name, then score), 'dcoll' (descending 
                                          collection name, then score), 'meta<f>' 
                                          (by metadata field f, then 
                                          score),'dmeta<f>' (descending metadata 
                                          field d, then score), 'shuffle' (random 
                                          to avoid bias), 'collapse_count' (to 
                                          order by the number of collapsed 
                                          documents, with the largest collapsed 
                                          set first), 'acollapse_count' (with the 
                                          largest collapsed set last), 'prox' 
                                          (for geo search:  Sort top results by 
                                          proximity to origin), 'dprox' (for geo 
                                          search:  Sort top results by descending 
                                          proximity to origin).  (dflt is 
                                          case-insensitive for title and meta) 
   -sort_sensitive=<B>                  - Use case-sensitive sorting when sorting 
                                          results by title or metadata strings. 
   -sortall=<B>                         - Include partial matches in the 
                                          resorting performed by -sort. 
   -specf=<S>                           - Fields listed in string s, as a list of 
                                          comma separated fields surrounded by 
                                          square brackets, will be scored 
                                          specially and added to query when using 
                                          the -sco=2 mode (dflt '[k,K]'). 
   -sss_defeat_pattern=<S>              - URLs matching the specified pattern 
                                          (currently a simple string match) will 
                                          not be subject to samesite suppression.  
   -static_cool_exponent=<F>            - Control the extent to which static 
                                          scores are attenuated with length of 
                                          query.  0 => no attenuation; 1 => max 
                                          attenuation. Attenuation by len ** -f. (range 0.000000 - 1.000000)
   -title_dup_factor=<F>                - The query processor will penalise a 
                                          result which has exactly the same title 
                                          as a previous result by multiplying by 
                                          the factor specified. The penalty 
                                          stiffens with more repetition. (range 0.000000 - 1.000000)
   -unknown_daysold=<I>                 - A doc with unknown date is assumed to 
                                          be d days old (for recency calcs) (dflt 
                                          366). (range 0 - unlimited)
   -use_Paik=<B>                        - Use the tf.idf scheme proposed by Jiaul 
                                          Paik at SIGIR 2013 rather than the more 
                                          conventional BM25 variant. 
   -use_secds=<B>                       - When working with domain-importance 
                                          features in ranking, use SECDs if value 
                                          is on, and raw domain names otherwise.  
   -vsimple=<S>                         - Very simple ranking. If set to 'on', 
                                          equivalent to -sco=0 -cool=off -SSS=0 
                                          -kmod=0. 
   -weight_only_fields=<S>              - Documents will not be retrieved in DAAT 
                                          mode if they only match unfielded query 
                                          terms in one or more of the implicit 
                                          fields listed here.  For example, 
                                          specifying '[K,k]' will stop the query 
                                          'Monica Lewinski' matching a document 
                                          solely because of click data or 
                                          referring anchortext. 
   -wmeta.<K=V>                           - wmeta.C=F Set upweighting factors for 
                                          metadata class scoring. C - metadata 
                                          class;  F - weight to set. (dflt 0.5 
                                          for 'k' and 'K', 1 for everything else). 
   -xscope=<S>                          - Present only results whose URL exactly 
                                          matches the provided URL (after 
                                          canonicalisation). 
S. Result collapsing options:

   -collapsing=<B>                      - Activate collapsing. Collapsing will be 
                                          based on document content ('$') unless 
                                          a collapsing_sig value is specified. 
                                          Note that use of this option will 
                                          disable result set diversification. 
   -collapsing_SF=<S>                   - Metadata fields to include in display 
                                          for collapsed documents (assuming 
                                          collapsing_num_ranks is non-zero).  
                                          (dflt no fields). To view metadata 
                                          fields 'id' and 'a' set this to 
                                          '[id,a]'. 
   -collapsing_label=<S>                - Label to indicate why items have been 
                                          collapsed.  (dflt "which are very 
                                          similar") 
   -collapsing_num_ranks=<I>            - Specify how many collapsed results are 
                                          to be shown under the uncollapsed ones. 
                                           (dflt 0) (range 0 - 1000)
   -collapsing_scoped=<B>               - Scope to only documents which have been 
                                          collapsed on. Default is off. 
   -collapsing_sig=<S>                  - The collapsing_control segment to use 
                                          when collapsing.  E.g. "[a,p]", 
                                          collapse on author+publisher. The value 
                                          must correspond to one segment of the 
                                          indexing.collapse_fields string. (A 
                                          segment is a comma separated list of 
                                          fields surrounded by square brackets) 
                                          (dflt '[$]' (Collapsing on document 
                                          content.)) 
T. Security options:

   -dls_internal_test=<I>               - This allows testing of the padre side 
                                          of the custom document level security 
                                          mechanism. There is no call out to an 
                                          external function. The value is 
                                          interpreted as a combination of bits:  
                                          1 bit - dls_internal_test is active/not 
                                          active; 2 bit - selects whether 
                                          MINRESULTS mode is used or not. During 
                                          internal testing, every odd numbered 
                                          document in the original ranking is 
                                          arbitrarily treated as inaccessible. (range 0 - unlimited)
   -ipreject=<S>                        - <queryLimit>,<windowSeconds>,<upperQuery

                                          requests from a single machine.  Allow 
                                          <queryLimit> queries per 
                                          <windowsSeconds>, don't record more 
                                          than <upperQueryLimit> queries.  [Not CGI]
   -ldLibraryPath=<S>                   - Full path to security plugin library  [Not CGI]
   -locking_model=<S>                   - Name of locking model, either "trim" or 
                                          "sharepoint".  [Not CGI]
   -no_security=<B>                     - Disable DLS, available as a command 
                                          line option.  [Not CGI]
   -secPlugin=<S>                       - Name of security plugin library  [Not CGI]
   -secPluginScript=<S>                 - Name of security plugin script  [Not CGI]
   -userkeys=<S>                        - Conduct this search with security keys 
                                          specified by s.  [Not CGI]
U. Spelling options:

   -spelling=<B>                        - Activate spelling suggestion mechanism. 
   -spelling_alpha=<F>                  - Set the weighting between 'closeness to 
                                          the query' and support in the 
                                          collection for a candidate suggestion. 
                                          Big alpha, high weight on closeness to 
                                          the query. (range 0.000000 - 1.000000)
   -spelling_blend_thresh=<F>           - Confidence threshold for automatically 
                                          blending results for a query suggestion 
                                          with those from the user's original 
                                          query. (range 0.000000 - 1.000000)
   -spelling_difflen_thresh=<I>         - Don't make suggestions more than i 
                                          characters longer or shorter than query. (range 0 - 1000)
   -spelling_dym_thresh=<F>             - Confidence threshold for making a 'Did 
                                          you mean' suggestion. (range 0.000000 - 1.000000)
   -spelling_edist_constant=<F>         - Don't make suggestions whose edit 
                                          distance from the query exceeds f + 
                                          query_length * spelling_edist_proportion (range 0.000000 - 1000.000000)
   -spelling_edist_proportion=<F>       - Don't make suggestions whose edit 
                                          distance from the query exceeds 
                                          spelling_edist_constant + query_length 
                                          * f (0<=f<=1) (range 0.000000 - 1.000000)
   -spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are 
                                          at least f * log10(num docs) full 
                                          matches. (range 0.000000 - unlimited)
   -spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are 
                                          at least f * log10(num docs) full 
                                          matches. (range 0.000000 - inf)
   -spelling_include_context=<B>        - Include the non-corrected part of the 
                                          query in the suggestion link. 
   -spelling_min_querylen=<I>           - Suggestions not made for queries 
                                          shorter than this. (range 1 - 1000)
   -spelling_wt_thresh=<F>              - Don't make suggestions whose weight is 
                                          less than this.  Weight is complex to 
                                          explain, sorry. (range 0.000000 - 100.000000)
V. TREC specific options:

   -trec_runid=<S>                      - For TREC participation: Each result in 
                                          TREC format will include this runid. 
   -trec_topic=<I>                      - For TREC participation: The first query 
                                          in a batch will get this topic number. 
                                          Each new query will increase the number 
                                          by one. (range 0 - unlimited)
   -trecids=<B>                         - For TREC participation: Each result in 
                                          TREC format will use the TREC docno 
                                          rather than a URL 


37. pan-look

Purpose: Efficient location of all lines in a sorted text file which match a prefix.

Usage: pan-look <prefix> <file name>

38. phrasefinder

Purpose: Extract frequently occurring word tuples ('phrases') from a collection.

Usage: phrasefinder <stem> [-unco] [-hash_limit=<i>] [-num_to_show=<i>] [-max_tuple=<i>] [-debug]

phrasefinder reads the .results file of <stem> and locates candidate word tuples
('phrases') up to a configurable maximum length.  Phrases are sequences
of up to max_tuple words which are unbroken by anything other than space.

Candidates are stored in a hashtable and counted.  A limit of
hash_limit candidates is stored.  Once this is reached, the program
exits. (Useful for testing or for limiting execution time and virtual
memory requirements.)  When processing finishes, the top num_to_show
candidates are sorted in descending order of frequency and output in
<stem>.phrases, in the same format as the .lex file, with word breaks
represented by hyphens.


39. show_annotations_for_doc

Purpose: Given an annotation index, summarise the annotations applied to a given URL.

Usage: show_annotations_for_doc <index_stem>|<collection> <URL>|<URL64>|<DOCNO> [-csv|-html|-xml]
  - Show the annotations applying to the specified URL and their weights.
  - Don't forget to quote or escape shell metacharacters in a URL!
  - Default output format is XML.


40. url_tagger

Purpose: Apply the tags in a tag mapping file to a PADRE index.

Usage1: url_tagger stem (-clear|<url-tags-file>)

Usage 2: url_tagger -v

url_tagger -clear clears all tags from all documents in the index
url_tagger <url-tags-file> takes an url-tags file and applies it to
the relevant entries in <stem>.results.

url_tagger -v shows version information.
Lines in the url-tags file are in the form: <url> <comma-separated-taglist>
It is assumed that <url> contains no space and tags contain no commas.