Padre binaries and command line usage
This page lists all the Padre binaries and their corresponding usage messages.
Purpose: Tuning padre-sw ranking parameters based on a C-TEST file. Usage: FineTune <collection>[.<profile>] ... [-perl_bin=path_to_perl_bin] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-alpha=<f>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][<mode> ...] [-conf] [-qp=padre_subpath] [-index_dir.<collection>=<index directory>] [-lock_file=<file to lock>] e.g. FineTune lse -daat -annieonly e.g. FineTune agosp.doha -timeout=7.5 Use -daat0 to tune term-at-a-time. Timeouts and query limits: Apply separately to each mode. Defaults are 5 hours and 1 million queries. After a timeout, or when the query limit has been exceeded, the best tuning found so far for that mode will be recorded in the .best file for the mode. -perl_bin=/path/to/perl used to set the path to perl binary to use -alpha sets the balance between success rate and wmum1 in tuning. [Dflt 0.75] Value must lie between 0 (ignore success rate) and 1 (ignore wmum1) -rvalues - sets no. of values to explore for optype=2 (real) [Dflt 11] -adjust - sets no. of steps to remove when adjusting exploration range for optype=2 (real) dimensions. [Dflt 5] -conf extracts the mode to tune from collection.cfg. (N/A for multituning.) -help gives more detailed instructions and exits. -index_dir can be used to set the index directory for a particular collection. The directory must contain a index prefixed by 'index'.) -lock_file can be used to lock a file for the entire duration of tuning, if the lock can not be acquired tuning will not start.) -redirect_stdout can be used to redirect stdout to a given file. -redirect_stderr can be used to redirect stderr to a given file. -write_finish_time_to writes the tuning finish time in ISO-8601 to the given file.
Purpose: Tuning padre-sw query-independent ranking parameters based on a C-TEST file. Usage: QiTune stem C-TESTfile Given a PADRE index and a file of useful URLs (extracted from the C-TEST file) compute a set of query-independent cool settings suitable for passing to padre-do which (hopefully) optimise the difference in ave scores between the useful docs and the general collection.
Purpose: Tuning PADRE spelling suggestion system based on a test file e.g. mycoll.spelltest. Usage: SpellTune <collection>[.<profile>] ... [-tune_bsi] [-help] [-verbose[=<level>]] [-timeout=<hours>] [-query_limit=<num_queries>] [-rvalues=<i>] [-adjust=<i>] [-sample=on|<number>][-qp=padre_subpath] e.g. SpellTune lse -annieonly e.g. SpellTune agosp.doha -timeout=7.5 (Timeouts and query limit defaults are 5 hours and 1 million queries. After a timeout, or when the query limit has been exceeded, the best tuning found so far will be recorded in the .bestspell file for the mode.) (-tune_bsi - tunes the build_spelling_index params. Slow.) (-rvalues - sets no. of values to explore for optype=2 (real)) [Dflt 11] (-adjust - sets no. of steps to remove when adjusting exploration range for optype=2 (real) dimensions.) [Dflt 5] (-help gives detailed instructions.) Invalid usage.
Annie version 1.13 (11 Mar 2010) Purpose: Builds an annotation index for a collection, specified by <stem>, from a list of files in anchors.gz format. Usage: annie-a <stem> [<stem_or_file> ...] [-phrasefile=<filename>] [-deb] [-hashbits <10..30>] [-maxlines <n>] [-wts <wt0> <wt1> <wt2> <wt3> <wt4>] [-stripstops] [-STOP=<filename>] [-canon] [-rejecturls] [-rejectnumeric] [-quicken] [-maxwds <i>] [-maxlen <i>] [-build_annou=on/off] [-build_lcache=on/off] [-nep_limit=0|1|2|3] <stem> must reference either a meta collection or a primary index. <stem_or_file> may be either a stem as above or the name of a file in anchors.gz format. In the case <stem> or <stem_or_file> is a meta collection, annie-a will look for the anchors.gz files from each of the component collections and use them for creating the annotation index for the collection specified by <stem>. If any anchors.gz file changes for a component collection, annie-a will need to be run again for the meta collection. -quicken improves query performance by using <coll id, doc id> pairs. The coll id is dependent on the sdinfo file, if the sdinfo file is changed annie-a will need to be run again for the meta collection with this option. It is recommended that the most recent collection is placed at the top of the sdinfo file.
Purpose: Convert URL references in an annotation index into (component, docno) to speed query processing. Usage: annie-quicken anno_stem index_stem
Purpose: Builds an auto-completion index file (.autoc) from a list of input files. Usage: build_autoc stem input_file ... [-collection name -profile name] [-partials] [-label_organics] [-debug] e.g. build_autoc index example.csv where example.csv will be sorted and indexed into index.autoc. Input_file(s) must end in .csv, .suggest, or .cfg -profile <name> - generate scoped .autoc file for the specified profile. A previous run of build_autoc must have been called with -index. -collection <name> - generate scoped .autoc file for the specified collection. Both -profile and -collection need to be specified when generating scoped suggestions -partials - this version allows multi-word organic suggestions to be triggered either from the full suggestions or from trailing word sequences. E.g. 'big fat cat' triggered from 'fat cat' and 'cat' as well as the full string. This option turns that on. -label_organics - present a category label for all the organic completions -sample <val> - Sample postings of suggestion terms, to handle large collections, <val> ranges 0 - 300; speeds up processing with the effect of sampling the suggestions. (1/val postings are used). build_autoc supports the building of a single .autoc file from multiple input files of the same or different types. Files with very simple format can be combined with hand-crafted files containing complex actions. Completion weights from a .suggest file are automatically determined, while they can be manually specified in a CSV file. Completion weights from Best Bets default to 100. Note: this binary requires query processor options to be set via the environment and the calling of it directly from the command line or via Funnelabck workflow is not supported. SUGGEST FORMAT -------------- .suggest files built by build_spelling_index can be supplied as input. Reasons for doing this include taking advantage of an index optimised for completion purposes; and integrating automated spelling suggestions with hand-crafted entries. CFG FORMAT ---------- Input files with .cfg suffix are no longer supported CSV FORMAT ---------- Each line of a .csv file must contain eight fields (7 commas), corresponding to: key, weight, display, display_type, category, category_type, action, and action_type. Fields except key and weight may be empty. Two meta characters are recognized within a field: backslash and double quote. These are handled as follows in the two cases: (A) Unquoted text: A single backslash is not passed through, while the character following it is passed through without applying any tests. This means that a double backslash in input leads to a single backslash in output and that commas or double quotes preceded by a backslash do not have their normal meaning. (B) Quoted text: The double quotes beginning and ending a quoted section are not passed through. Within a quoted section a double quote may be passed through by either doubling it ("") or by preceding it with a backslash (\"). By these means it is possible to pass through HTML and/or JSON containing quotes and/or commas
Purpose: create a match only index, which build_autoc can use to build profiled query suggestions. Usage: build_match_only_index stem
Purpose: To build a spelling suggestion file (.suggest/.suggest2) for a collection. Usage: build_spelling_index index_stem num_thresh [<metadata_class_names> [[<lexiweight>] [[<blacklist_file>] [<whitelist_file>]]] e.g. build_spelling_index index 2 [@,t,c] where the listed comma separated metadata class names, '@,t,c', are the ones to be scanned for suggestions. '@' means use the .anno file. '%' means use unfielded words from index. lex. + means use phrases from index.phrases (if present). If no fields are listed, "@,+,t,%" is assumed. num_thresh - minimum weight of suggestions recorded in suggest index lexiweight - controls the weight of lexicon suggestions relative to annotations. wt = lexiweight * sqrt(df) (dflt lexiweight = 1.00) blacklist_file - manual list of suggestions which should NOT be included in the index. (one per line) whitelist_file - manual list of suggestions to include in the index. (one per line.)
Purpose: To convert a tuning file in CSV format into a C-TEST file for use with e.g. FineTune. Usage: csv2ctest: infile.csv [-utils=recip|-utils=equal] [-queryweights] Output in C-TEST format will be in infile.ctest The input file is assumed to be a syntactically correct comma separated value (CSV) file in which cells are separated by commas. Double quotes around all or part of a cell allow inclusion of commas. The quotes are stripped off before processing. The input file may contain comment lines starting with a hash. The first column in infile.csv is always assumed to contain a query. If no options are given, then the remaining columns contain desired answer URLs for that query, in descending order of utility. Utility scores start at 4 and then gradually decline to 1: 4, 3, 2, 1, 1, 1 ... This behaviour may be modified as follows: -utils=equal - All of the answers are given equal utility values. -utils=sqrt - Utility values drop off as 1/sqrt(rank). -utils=recip - Utility values drop off faster -- as 1/rank. -queryweights - if this is given, the second column is expected to contain the numerical weight associated with the query and the remaining columns contain the answer URLs
Purpose: To display the contents of an annotation index in geek-readable form. Usage: dump_annotation_file <annotationfile>
Purpose: To display the contents of a query completion file in geek-readable format. Usage: dump_autoc <stem|collection|autoc_file>
Purpose: To display the contents of a spelling suggestion file in geek-readable format. Usage: dump_suggestion_file <index_stem> - dumps contents of <index_stem>.suggest
Purpose: Map a URL to that document's number within an index stem. Usage: get_docnum_from_url <index_stem> <url> Prints the docnum for a given URL to standard out. Prints "notfound" if the URL is not found. <filestem> - the common prefix (including path) of the index files <url> - the URL to look up the document number for
Purpose: Within an index, output the URL of the document identified by component number and document number. Usage: get_url_from_component_document_pair <index_stem> <component_number> <document_number> Warning: Doesn't handle nested .sdinfo files. (Hierarchical meta collections.)
Purpose: print the URL of a document in a primary index, given its URL. Inverse of get_docnum_from_url. Usage: get_url_from_docnum <index_stem> <doc_num>
Purpose: Extract a subset of entries in a list of anchors.gz files which match a specified pattern. Usage 1: harvest_anchortext -targ|-text|-any|-source pattern anchor_text_file ... Usage 2: harvest_anchortext -noneps <affiliates_file> anchor_text_file ... Extracts a subset of lines in the anchor_text_files. The composition of the subset depends upon the match_type argument as follows: -any - Any line (source or target) which matches pattern -targ - Any target line whose URL target matches pattern -text - Any target line whose anchortext matches pattern -source - Any source line which matches pattern + In usage 2, links within the same SECD (single-entity-controlled-domain) are suppressed, as are links between affiliated pairs of hosts listed in the affiliates_file. If there is no affiliates_file, use '-'. + Whenever a target line matches the corresponding source line is also output. + Whenever a source line matches its corresponding target lines are also output. + NOTE: nepotistic links are included unless -noneps is used.
Purpose: Extract hierarchical navigation paths from a list of anchors files. Usage: hierarchical_navpaths <stem> [-verbose] [<anchor_text_file> ...] Reads <stem>.anchors.gz, plus any additional anchortext files to identify hierarchical navigation paths (HNPs). These are output to <stem>.hnp.anchors.gz in standard anchors.gz format: <target_url> --- [H]<concatenated anchors from path> + NOTES: 1. inter-host links are ignored. 2. -verbose prints the actual HN paths to stdout. 3. All targets in the .hnp.anchors file have http://hnp as source. Warning: Not ready for use. Development of this utility is incomplete.
Purpose: Analyse a list of anchors.gz files and report on frequencies of inter-host links. Usage: host_host_link_counts [-targ|-source <pattern>] [-report] <stem> [anchor_text_file ...] Reads the anchor_text_files and outputs a table of host-host links, in descending order of link count. By default, all lines are processed, but a pattern can optionally be applied to either targets or sources. If -report is given, short and full HTML reports will be generated. -targ - Process only links whose target host matches pattern -source - Process only links whose source host matches pattern Nowadays, the first option not starting with a - is an index stem. A file <stem>.hosts is created with a table of host-related feature scores which can be used in ranking. The order of entries must correspond to the hostnum order assigned by padre-iw. + NOTE: within-host links are excluded.
Purpose: To help with conversion of padre-sw argument lists from old to key=value format
Purpose: To build an index.collapsig file to permit use of collapsed rankings. Usage: padre-cc <index_stem> [-collapse_control=<string>] [-debug=on] Utility for building a .collapsig file of collapsing signatures. If no control_string is given, a one-column file is built using the signatures from the .textsig file. The collapse_control string must consist of sets of sequences of metadata class names. Each set should be surrounded by square brackets, and sets should be separated by commas. Metadata class names are the elements of the sets and must be separated by commas. The characters $ and # may be used as special metadata class names and represent document summarisable text and document URL respectively. In future, it is planned to allow special metadata class names to be followed by a regular expression, indicating that only the part of the metadata string which matches the regex should be used in calculating the signature. Example current control string: '[$],[t,a]'. In this case the .collapsig will have two signatures per document: Column 0 is the normal document signature and column 1 is a signature derived from the concatenation of metadata fields t and a, in that order.
Purpose: Report on the titles in a PADRE index. Eventually to improve them. Usage: padre-ct <index_stem> Warning: Development of this utility is not yet complete.
Purpose: To check the correctness of an index, compare two indexes, or display postings for a term within and index. Usage0: padre-cw -v - print PADRE version Usage1: padre-cw stem1 stem2 [-io] - Compare two indexes. -io means ignore diff.s in offsets into .idx Usage2: padre-cw stem1 -show term - show postings for term. Also shows term before and afterward. (if applic.) Usage3: padre-cw stem1 -check [-stemsuff] [-show_all]- Check index files for stem1 (default) use -stemsuff <suffix> to supply an additional suffix for the .idx and dct files use -show_all to print every terms summary information.
Purpose: To display the metadata for documents in an index. (main purpose) Usage: padre-di <index_stem> [-check]|[-trecids]|[-metao [<docno>]][-meta [<pattern>] ] | [-metad [<pattern>] ] -check - check whether the document table appears to be internally consistent -trecid - make a mapping between trec DOCNO stored in title field and URL -meta [<pattern>] - print title and metadata information for each document whose "URL" contains pattern (case-insensitive) If no pattern is given, all docs are shown, in collection order. -metad - as for -meta but show document numbers. -metao - as for -meta but show all documents, in collection order starting from docno (default zero). -doc_per_meta - prints in JSON the number of documents each metadata class appears in. default - read in URLs and look them up, using sorted table
Purpose: Print a permutation of the document numbers in an index corresponding to descending static score. Usage: padre-do <stem> <docorderfile> [-deb] [-cool_param ...] Output is a list of docnums in descending order of cool score, printed to docorderfile. cool_param values are expected to lie in 0 - 1. Default values are the same as for padre-sw though. Of course, query-dependent cool values cool0, 7, 12, 15, 16, 17, 18, 19 are ignored because there is no query.
Purpose: Display or operate on the document flags in an index. Usage1: padre-fl <index_stem> [-clearall|-clearbits|-clearkill|-killall|-show|-sumry|-quicken] Usage2: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -unkill Usage3: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -kill Usage4: padre-fl <index_stem> <file_of_url_patterns> [-exactmatch] -bits hexbits OR|AND|XOR Usage5: padre-fl <index_stem> -kill-docnum-list <file_of_docnums> Usage6: padre-fl -v Note: Specify '-' as the file of url patterns to supply a single URL to standard input.
Purpose: Display or manipulate document gscopes in an index. Usage0: padre-gs -v|-V|-help # print version info or detailed help on types of instructions and on program operation. Usage1: padre-gs index_stem -clear # clear all gscopes Usage2: padre-gs index_stem -show # show all gscopes Usage3: padre-gs index_stem file_of_instructions [-separate] [other_gscope] [-regex|-url|-docnum] [-verbose] [-quiet] [-dont_backup] Where: * index_stem may also be the name of a collection * file_of_instructions may be '-' to accept instructions from stdin * -separate indicates that gscope changes should be made to a copy of the .dt file first, and then copied over the original file when changes are complete. In this mode the number of gscope bits can NOT be expanded you will be required to ensure enough is available. * other_gscope specifies a gscope to be set on documents which end up with no gscopes set. * By default instruction patterns are expected to be regexes but this may be made explicit with -regex or altered with -url or -docnum. Use padre-gs -help to obtain more information about instruction formats and pattern types. * gscope names may consist of alphanumeric ascii characters up to a length of 64 characters. * -dont_backup prevents backing up of the .dt file * -quiet don't show the before and after summary of gscopes
Purpose: Display aggregated information about a URL from a PADRE index. Usage: padre-i4u -v | padre-i4u stem=<stem_or_collname> [fields=<alnum_string>] [debug=<int>] [iters=<int>] [format=json/old] [coll=collection_name] url=<url> ... Note: The functionality is implemented by a dynamic library which is usually called directly. coll= option should be set to the name of the collection corresponding to the stem= option
Purpose: Index a collection of documents Usage1: padre-iw -V|-help|-helpadoc|-ixform (print version or help info.) Usage2: padre-iw [-f|-tar|-reo<pf>] <dir>|<file>|<url> <filestem> [<option>|<tfdir> ...] Usage3: padre-iw -secondary_update <dir>|<file> <filestem> <pf> is a text file containing a permutation of the document numbers in the original index. <dir> is a hierarchical directory of optionally gzipped files. <file> contains a list of names of optionally gzipped files. -f says that <file> is a single datafile to be indexed. For historic reasons -tar means the same as -f. Files to be indexed may be tar or WARC files (optionally gzipped). Note that individual files in a tarfile are expected to be uncompressed. text. Gzipped files, unfiltered PDFs etc. are not supported yet. -reo says that <file> is the stem of a previous index to be reordered and reindexed. <pf> is a text file containing a permutation of the document numbers in the original index. Eventually, it may be possible to compute the permutation internally. For now, it must be specified via <pf>. <filestem> prefixes the names of output files. <tfdir> is the dir. in which tmp files will be writ. -secondary_update creates a secondary index using the data directory specified, and using the options used in creating the primary index. Available options: A. Getting information about PADRE and its operation. -V - Print PADRE version number and exit. -ixform - Print index format version created by this indexer exit. -help - Print this list and exit. -helpadoc - Print this list in asciidoc format and exit. -debug - Generate debugging output. -show_each_word_indexed - For debugging. Show each word occurrence (with field) as it is indexed. -show_each_word_to_file - For debugging. Print each word occurrence (with field) to <filestem>.words_in_docs -hashlog - Create a .hashlog file with incremental hashing stats. -quiet - Use terse logging. -ankdebug - Generate debugging output relating to anchortext. -termdeb<term> - print debugging messages relating to the indexing of <term>. B. Controlling what is indexed. -nometa - Don't index any metadata except t, d and k (titles, dates and links). -nomdsfconcat - Don't concatenate strings in the mdsf file. Record first only. (Others are still indexed.) -diwimuu - Don't index words in made-up URLs (those constructed from filepath). -dias - Don't index link anchor source as part of source documents (<a> only). -ibd - Index all documents even if they appear to be binary. -ixcom - Index words in HTML and XML comments. -select<num1>,<num2> - Index every num1th file/bundle starting from num2th(from zero). -select-doc-in-bundle=<interval>,<offset> - Index every <interval> document within a bundlestarting from <offset> (which starts at zero). Only works with warc store. -tarpat<regex> - Filenames in a tarfile being indexed must match regex. Default is match-everything. -csv=<fsep><skipfirst>[<quote>] - Deprecated. Use the CSV to XML filter instead. Files which are not clearly something else are assumed to be CSV format. fsep is ascii field separator, typically comma. (tab is represented by t.) skipfirst is either y or n, telling padre whether the first line in a CSV file should be skipped. quote is the character used to quote strings in fields which may contain separators. (You probably have to escape it on the command line.) If not specified, no quote character is defined. To include a quote character within a quoted section, the quote may be doubled. -csv_fields=<comma_separated_descriptor_list> - Deprecated. Use the CSV to XML filter instead. This is a list of comma separated descriptors describing how to index each column of the csv file. To index terms in a column as document text use '-'. To index terms in a column as metadata use the format: <metadata class name><content type> To skip terms in a column use 'X'. For example: 't1,-,X', would set the first column to title, the second column would be indexed as document content and the third column would b skipped. Content-type defined in this argument should be the same as the content type in the metadata mappings -check_url_exclusion=<on|off> - URLs matching url_exclusion pattern will not be searchable. (Default on.) -url_exclusion_pattern=<regex> - exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/') -filepath_exclusion_pattern=<regex> - exclusion pattern to use if files are to be excluded from indexing on the basis of the filepath. If applicable, this is more efficient than excluding by URL because the URL can't be finally determined until the content has been scanned. (Default: not set) -index_subversion_dirs - Normally the .svn directories created by the subversion version control system are not indexed. Override this default. C. Controlling how things are indexed. -noax - Don't conflate accents. -unimap=<mapname> - specify a Unicode mapping to be applied when indexing and when query processing. Supported values: tosimplified, and totraditional. (Chinese only.) -deutsch=<i> - How much extra processing is done for umlaut and sz. 0 - none. München is indexed as München and Munchen 1 - München is indexed as München, Muenchen and Munchen (Dflt) 2 - As for 1 but also Muenchen is indexed as München, Muenchen and Munchen (As a side-effect to allow for compounds, SORT_SIGNIF is increased to 40 -nz=<i> - How much extra processing is done for Māori. 0 - none. Māori and Mäori are indexed as Māori or Mäori resp. and Maori (Dflt) 1 - Māori is indexed as Māori, Maaori and Maori Mäori is indexed as Mäori, Māori, Maaori and Maori -no_cjkt_grams - Suppress the indexing of bigrams/unigrams in CJKT text. It is assumed that said text has been pre-segmented into words, and that normal word-based indexing is needed. -QL_depth=<i> - Activate quicklinks on default pages of up to depth i. Use internal QL defaults. (Dflt 0 = Off) -QL_config=<f> - Activate quicklinks. Read quicklinks configuration options from file f. -docscan_depth=<i> - When trying to determine doc type and charset indexer will look up to i char.s into the fdoc. (Dflt 20480) -forcexml - Use the XML parser on all documents. -case - Store case information in postings. Currently unsupported. Note that setting this reduces the approximate max number of unique terms from ~950M to ~240M. -SORTSIG<num> - How many [UTF-8] characters in a word are significant. Default 30 -dilw - Don't index words or use words in summaries that are longer than what is set by -SORTSIG. D. Controlling metadata indexing. -xml-config=<file> - <file> specifies a file defining XML indexing configurations in json format. -MM=<file> - <file> specifies a file defining metadata mappings for both HTML and XML documents. -XMF<file> - (Deprecated) <file> specifies a file defining XML field mappings. -MMF<file> - (Deprecated) <file> specifies a file defining meta tag mappings. -ifb - Index a special word '$++' at the start and end of each metadata field (on by default). -noifb - Do not index a special word '$++' at the start and end of each metadata field. -facet_item_sepchars=<string> - Which chars are used to separate metadata facet items. [Dflt '|'] -map[<f>] - Map anchor text in source file to metafield f. If <f> is absent, outgoing anchortext is unfielded content. (dflt <f> absent) -EM<file> - <file> is a file of external metadata, multiple files may be supplied by setting this multiple times. -NIM - Ignore explicitly specified internal metadata. -collfield=<f> - Index the name of a collection as metadata in each doc and assign to field f. -collection_name= - Set the name of the collection being indexed. -metadata_topk_capacity=<I> - Sets the maximum number of metadata names or XML paths padre will keep track of for counting the most frequent metadata or xpath that could be mapped. -metadata_topk_k=<I> - Sets the number of the most frequent metadata names or XML paths padre should report on after indexing. E. Controlling link and anchortext handling. -noank_record - Don't extract, record or index anchortext. - .anchors.gz file not processed. No link counts possible. -noank_index - Extract and record but don't index anchortext. - .anchors.gz file can be post-processed by annie-a -noank - Temporary synonym for -noank_index. Deprecated. -nocanon - Don't canonicalize URLs when storing URLs or matching anchortext. -canon.anchor_collapse=<on|off> - Controls whether PaDRE should canonicalise URLs with fragments (anchors) on: PaDRE will drop any characters with the anchor symbol (#). off: PaDRE will treat URLs with anchors as unique URLs. -dpdf - Produce but don't process the anchors distilled file. -nep_action=<0|1|2> - Action to take for nepotistic links. 0 - treat the same as other links. 1 - ignore links of types greater than nep_limit. 2 - limit the number of repetitions of links of types greater than nep_limit. (dflt) -nep_limit=<0|1|2|3> - Ignore nepotistic links of types greater than the limit. 0 - unaffiliated links from outside the target domain. 1 - links from a different host. 2 - links from the same or a closely affiliated host. 3 - dynamically generated links from such a host. -nep_cachebits=<i> - Don't let the low-value link cache grow above 2^i -noaltanx - Don't index image alt as anchortext when an image is an anchor. -nosrcanx - Don't index image src as anchortext when an image is an anchor. -BL<f> - <f> is a file of source URL patterns from which links should be ignored or treated with suspicion (Blacklist). -AD<f> - <f> is a file of SECD (single entity controlled domain) affiliations. e.g. griffith.edu.au --> gu.edu.au Links to an affiliated SECD are classified as within-domain. -RP<f> - <f> is a file of CGI parameters which should be removed from source and target URLs. - padre generates a regular expression from the lines in <f>. - if <f> is "conf_file" the regex be taken from crawler.remove_parameters; in the FunnelBack config file. -A<pat> - <pat> is an acceptable link target pattern. - URLs not matching pat will not be stored in anchors.gz file. - if pat is "conf_file" pat will be taken from include_patterns in FunnelBack config file. -F<file> - *<file> is an additional anchor text file. -FN<file> - Like -F but source URLs should need not be looked up. -RD<dir> - *<dir> is a directory in which to look - for redirects and duplicates files. - (produced by FunnelBack etc. & PADRE). -igmaf - Ignore main anchors file. -mule<n> - Discard links to URL targets longer than <n> chars. Default is no limit. -rmat - Record targets of failed anchor lookups via stdout. -create_phrase_metadata_terms=<b> - Enables the creation of phrase terms like "$++ foo bar $++" in the dictionary for metadata. These phrase terms can be used to speed up queries like a:"$++ foo bar $++". Phrases will only be created if indexing of field boundaries is enabled, which it is by default. Disabling may reduce indexing time and index size. F. Controlling which index files are generated. -nomdsf - Suppress generation of the .mdsf file. -nolex - Suppress generation of the .lex file. -noqicf - Suppress computation of QIC features and .qicf file. -nohostf - Suppress computation of host features and .ghosts file. -cleanup - Remove superfluous files from the index directory after index has completed. G. Setting size limits. -GSB<n> - How many gscope bytes to allow for. Default/Min: 8/2. NOTE: This setting is no longer required as gscope bits are now auto-sized. -big<N> - Multiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5). -small - Divide word table sizes by 4 from base of 256K (i.e. use 64K). -chamb<num> - Set decompression chamber size to <num> MB. Default 32 -RSDTF<num> - Set maximum characters in description & title fields in .results to <num>. Default 256. -RSTAG<num> - Set number of bytes to reserve for tags in .results to num. Default 0. -RSTXT<num> - Set maximum characters in summarisable text per doc in .results to <num>. Default 50000, maximum of 10000000. -W<num> - Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM). Default for a 64bit system is 2800 -MWIPD<num>- Maximum words indexed per document (excluding anchors). By default all words are indexed -maxdocs<num>- Maximum no. of documents to index. Others are ignored. -mdsfml<n> - Set the number of bytes used for MetaData Summary Field Maximum Lengths. Fields larger than this number will be truncated. Default is 2048. -lock_string_mod_mode=[legacy|raw] - Sets how padre should modify the lockstring before it is stored, 'legacy' mode which removes some characters, replaces unquoted commas into new lines and removes consecutive new lines. 'raw' mode stores the lock string as is up to the first null. -99% - Limit on how full the word hash table can get. H. Special indexing modes. -duplicate_urls=flag|ignore (Default is flag.) - Documents whose URL checksum is identical to that of another document are normally flagged and suppressed from results. -urlchecksums=case_sensitive|case_insensitive (Default is case_insensitive). -paidads - If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag). -doc_feature_regex=<Regex> - Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX. The presence or absence of this feature can be used in the ranking function, controlled by cool29 and cool30. -iolap - Overlap reading of bundles with processing them. -utf8input - Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.) -isoinput - Assume all input files whose charset is not specified are ISO_8859-1 encoded. -force_iso - Forcibly assume all input files are ISO_8859-1 encoded. -URLP<str> - When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it's own URL with a BASE HREF element, such as in local collections) -lmd - HTTP LastModified date takes priority over metadata dates. -lmd_never - Completely ignore HTTP LastModified dates. -ignore_link_rel_canonical - Ignore canonical URL declarations in HTML link elements. -ignore_noindex - Ignore robots noindex meta element. -DT<str> - Interpret <str> as start of new doc within bundle. (Not a regular expression). (note that there is a separate mechanism for XML). -annie[<exec>] - After normal indexing is complete, attempt to build an annotation index (annie) and a spelling suggestion file. Default executables are annie-a and build-spelling-index from whence padre-iw was run. -speller[<exec>] - Allows the explicit specification of a spelling_index builder to run after annie-a. -spelleroff - turns of spelling-index building even if annie-a runs. -spelling_threshold<i> - Annotations with fewer than i occurrences will not be considered as spelling suggestions. (dflt 1) -bigweb - Space saving option for bundled large crawl indexes. Roughly equivalent to: -nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2 -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet * A shorter average wordlength is assumed. * You can add e.g. -Axxx.com to cut anchor processing time. * (Don't forget to make dupredrex.txt in index directory.) I. Miscellaneous options. -O<name> - <name> is the name of this organisation. -T<path> - Specify a large temporary filespace for use by the indexer. -redis_host=<str> - Hostname/IP of a Redis server where progress status should be written -redis_port=<i> - Port of the Redis server. Default is 6379 S. Security options. -security_level=<i> - Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop. -security_mindocs=<i> - Must be at least this number of docs with at least one lock. *** See also url_exclusion options in Section B above.
Purpose: To merge a list of PADRE indexes into a single such index. Usage: padre-mi outstem instem instem ... [-overwrite] [-cleargscopes] -overwrite overrides protection against destroying existing outstem -cleargscopes clears all set gscopes from the resulting index Make a merged index (outstem) from the list of at least two input indexes. This version assumes that input indexes have exactly the same format, i.e. that the index format strings are the same and that they have identical numbers of gscope bits, numerical metadata fields and so on. Future versions may check this compatibility, but currently exact compa- tibility is assumed. All manner of pestilence may descend upon you if you use padre-mi on incompatible indexes. You have been warned :-)
Purpose: To setup a query-independent-evidence file for use in query processing. Usage: padre-qi index_stem file_of_url_patterns dflt_score [profile_name] [-verbose] - if a profile name is given, qiefile will be stem.qie_profile Each URL in the index is matched against the patterns, in the order in which they are listed in the pattern file. Once a match is found, matching ceases for that URL. This behaviour can be exploited to apply a general pattern (later in the file) if no more specific pattern (earlier in the file) matches. To achieve exact matching use ^ (matches start of URL) and $(matches end of URL Lines in the patterns file consist of: <qie score> <url-pattern> qie-score - a floating point number (assumed normalised to the range 0-1), specifying the qie score to be applied. url-pattern - a perl5 regular expression to be matched against name strings in the .urls file (usually URLs). Example: 0.25 ^(https://)?[^/]*nsw.gov.au/ 1.0 ^(https://)?[^/]*wa.gov.au/ 0.25 ^(https://)?[^/]*sa.gov.au/ 0.25 ^(https://)?[^/]*nt.gov.au/
Purpose: To generate query suggestions given an index and a partially typed query. Usage: padre-qs -v | padre-qs stem=<stem>|collection=<collname> partial_query=<partial_query> [alpha=<f>] [show=<d>] [fmt=xml|json|json++] [callback=foo] [sort=0|1|2] [profile=<profile>] [debug=0|1|2|3], e.g. Note: The functionality is implemented by a dynamic library which is usually called directly. padre-qs stem=/opt/funnelback/data/abc/live/idx/index partial_query=kevi alpha=0.5 show=10 - sort=0 (by weight), 1 (by length), 2 (in alphabetic order), 3 (by weighted combo of weight and length). - fmt=json => simple JSON array of suggestion strings; =json++ => full JSON object with all fields shown. - callback=foo => In JSON or JSON++ output will wrap the response with the supplied callback (for JSONP). - show=<d> => how many suggestions to show. - alpha=<f> => if sort=3, score = alpha * weight + (1 - alpha) * length_score.
Usage: padre-query-parser -query=[Query to canon] Returns to standard out a mostly canonicalised query.
Purpose: Generate a relevance-feedback query given a list of relevant documents in a collection. Powers the Explore feature. Usage: padre-rf -v | padre-rf -idx_stem=<index_stem> [-exp=<7..50>] [-comp=<comp_num> -dox=<docnum_list> | -url=<url>] ... Details of available options: R. -collection=<X> - The name of a collection, either meta or primary. R. -script=<S> - Name of the CGI script to which padre-rf.cgi should redirect. (dflt "(null)") R. -idx_stem=<Y> - The index stem for this collection, either meta or primary. [Not CGI] R. -exp=<I> - Maximum complexity of generated query (no. of words). (range 7 - 50) (dflt 10) R. -deb_rf=<I> - Activate debugging output. Higher values give more verbose output. (range 0 - 10) (dflt 0) R. -comp=<I> - Component number within a meta collection. (range 0 - unlimited) (dflt 0) R. -dox=<D> - Comma separated list of document numbers within current component. R. -url=<E> - URL of document to be included in generation of RF query.
Purpose: Display the contents of a padre index file in readable format. Usage: padre-show <padre_index_file> -- if poss. displays contents of index file in text form. e.g. padre-show index.urls
Purpose: Create a skip block index from a regular padre index. Usage: padre-sk <stem> <skip> Output will be in <stem>.idx_skip and <stem>.dct_skip <stem> String: the index stem to use <skip> Integer: the minimum number of postings between each skip block
Purpose: Display all or part of the content of the .results file. (Title, URL, Description metadata, and candidate sentences for summary generation. Usage: padre-sr stem|results_file [-titleonly] [-unco] [-ifff|-embedded|-text|-html|-textsigs] [starting_doc|starting_url] [urlpat=<regex>] [num_docs_to_show] padre-sr sequentially reads the .results file and outputs all or part of the file to stdout in a choice of formats: . html (default) . embedded (incomplete html suitable for embedding in another html document) . text .textsigs (generate stem.textsigs file suitable for neardup detection.) If -titleonly is given only the document titles are output. (not applic. to textsigs) Use -unco to specify that the input doc. is in old uncompressed format. If a starting document number or URL is given, output commences only when that point in the file is reached. Output continues to the end of the file unless num_docs_to_show is given. If urlpat= is given, only documents whose URL matches the pattern are considered for display. Case-sensitive unless specified otherwise in the pattern. Don't include 'http://' in the pattern.
Purpose: Process queries using a PADRE index. Usage: padre-sw <filestem> [option ...] <filestem> - the common prefix of all the index files, or possibly the name of a Funnelback collection. Available options: A. Getting information about PADRE and its operation. -V - Print version number and exit. -ixform - Print index format version expected by this query processor and exit. -help - Print this list. Notation: --------- <B> - Boolean. Will be interpreted as TRUE unless arg is 'off', 'false' or '0' (case insensitive). <I> - Integer. eg. 7 or 100000. Whole number in specified range. <F> - Number. e.g. 1 or 0.537 or 99.5. Some inputs of this type are constrained to lie within [0.0 - 1.0]. <C> - Character. e.g. a or A or : A single character. <S> - String. eg. abc or "a b c". Quotes needed around the key and value if spaces or punctuation included - for example: "-optionname=a b c". <K> - Key/value pair. These options take a key and a value, for example, -optionname.KEY=VALUE I. Contextual navigation options: -categorise_clusters=<B> - Whether contextual navigation suggestions are grouped by type. -cnto=<F> - Set contextual navigation time-out to s seconds (s floating point). processing may be omitted entirely if elapsed time for a query already exceeds s seconds. (dflt 1.0). (range 0.000000 - unlimited) -contextual_navigation=<B> - Whether or not to activate the contextual navigation system. -contextual_navigation_fields=<S> - String s lists the metadata fields, separated by commas surrounded by square brackets, to scan for contextual navigation suggestions. (dflt '[c,t]'). Note that scanning of document text can be suppressed by including a minus, for example '[-,c,t]'. -max_phrase_length=<I> - Maximum length (in words) of contextual navigation suggestions. (range 3 - 7) -max_phrases=<I> - After this number of candidate phrases have been checked, contextual navigation processing will stop. (range 0 - unlimited) -max_results_to_examine=<I> - Maximum number of search results to scan for contextual navigation suggestions. (range 0 - 200) -site_max_clusters=<I> - Maximum number of site clusters to present in contextual navigation. (range 0 - unlimited) -topic_max_clusters=<I> - Maximum number of topic clusters to present in contextual navigation. (range 0 - unlimited) -type_max_clusters=<I> - Maximum number of type clusters to present in contextual navigation. (range 0 - unlimited) J. Geospatial options: -geospatial_ranges=<B> - Calculate geospatial distance from origin and bounding box ranges when geospatial data is configured and available. -maxdist=<F> - Exclude results not within <f> km of origin. (range 0.000000 - unlimited) -origin=<S> - <lat,long> Set origin to lat, long (floating point degrees). K. Informational options: -canq=<B> - Write reordered queries to log. (dflt off) -countIndexedTerms=<S> - Metadata fields to have their indexed terms counted in the result set (DAAT only). Unlike rmcf multiple term occurrences in a single document are counted e.g. if metadata 'author' has 'Bob Ada|Bob|Bob' in two documents the resulting counts would be 'Ada': 2, 'Bob': 6. As this counts indexed terms long terms may be truncated depending on the indexer options used. To count fields 'a' and 'c', set this to '[a,c]'. [Not CGI] -countUniqueByGroup=<S> - Counts the number of unique metadata values grouped by another metadata. Syntax: -countUniqueByGroup=[classToCount]:[groupBy],[classToCount]:[groupBy]. Example: -countUniqueByGroup=[author]:[project] would show us the number of authors contributing to each project. classToCount is a regex and will be expanded to all matching metadata classes e.g. [author.*]:[project] might expand to -countUniqueByGroup=[author]:[project],[authors]:[project]. [Not CGI] -count_dates=<S> - Report facet counts for dates such as 'today', 'last week', 'this year'. Note that date categories may overlap. Only value currently supported is 'd'. -count_urls=<I> - Display counts of results grouped by the URL path (Up to depth i). If <I> is 0, then the default value is used. Dflt 5. If <I> is not present count urls is turned off. [Not CGI] -docsPerColl=<B> - Show the number documents each collection contributed to the result set. -rmcf=<S> - Metadata fields to have their words counted in result sets (fields representing facets). If metadata 'author' has 'Bob Ada|Bob|Bob' in two documents the counts would be 'Bob Ada': 2 'Bob': 2. To count fields 'a' and 'c', set this to '[a,c]'. -rmrf=<S> - Numerical and geospatial fields listed will have their ranges calculated in result sets. To see the ranges of field 'height' and the bounding box geospatial field 'X' set this to '[height,X]'. -showtimes=<B> - Print elapsed times for each stage of query processing. -sum=<S> - The sum of a numeric metadata in result set. Syntax: -sum=[sumOn],[sumOn]. Example: -sum=[size] would sum all values of numeric metadata 'size' in the result set. Note sumOn my be a regex which expands sumOn to all matching metadata classes e.g. -sum[size.*] might expanded to -sum=[sizeInKb],[sizeLoc]. [Not CGI] -sumByGroup=<S> - The sum of a numeric metadata by a group. Syntax: -sumByGroup=[sumOn]:[groupBy],[sumOn]:[groupBy]. Example: -sumByGroup=[size]:[project] would sum all values of numeric metadata 'size' grouped by 'project' giving output project 'Foo' has size '128', project 'Bar' has size '12'. Note sumOn my be a regex which expands sumOn to all matching metadata classes e.g. -sumByGroup[size.*]:[project] might expanded to -sumByGroup=[sizeInKb]:[project],[sizeLoc]:[project]. [Not CGI] L. Logging options: -ip_to_log=<S> - What form of ip to include in log files: (nothing|ip|ip_hash|remote_user). -log=<B> - Write query log entries (dflt on). [Not CGI] -qlog_file=<S> - If writing query log entries, write them to <FILE>. [Not CGI] -username=<S> - A string identifying the current user to be used in padre's query log. M. Miscellaneous options: -countgbits=<S> - s is either "all" or a comma-separated list of gscope bitnumbers for which counts are needed. (Bits numbered from zero.) -exit_on_bad_component=<B> - Fail when a component has an incompatible index relative to the first (rather than skip). -flock=<B> - Use flock when locking the query logfile. If set to no, lockf is used instead. Default on Solaris is 'no', all other systems 'yes'. -mat=<I> - Set matchset size to n million (dflt 24). Only need to increase on very large collections. (range 0 - 2147) [Not CGI] -ndt=<B> - Don't do tests on docs, e.g. phantom, zombie, *scope, binary, expired. [Not CGI] -unbuf=<B> - Don't buffer the standard output stream. In some specific cases, setting this to 'no' can improve performance. -view=<S> - The collection view the perform the query against when in CGI mode. Normally 'live' (default), 'offline' or 'snapshot###'. N. Presentation options: -EORDER=<I> - Specify presentation order of query biased summary excerpts. 0: natural order in doc. 1: sorted by score. (dflt 0) (range 0 - 1) -MBL=<I> - Set buffer length per displayed metadata field to n bytes (dflt 250 bytes). Warning: setting very large values will increase query processor memory demands and may cause problems. (range 1 - unlimited) -SBL=<I> - Set summary buffer length to n bytes. (dflt 250 bytes) (range 1 - unlimited) -SF=<S> - Metadata fields to include in summaries. (if applicable). To include fields `author` and `d` set this to `[author,d]`. This option also supports regex to include all metadata classes set this to `[.\*]` to include fields prefixed with `Fun` and metadata class `author` set `[Fun.*,author]`. -SHLM=<I> - Select highlighting method within snippets in XML. 0 - No highlighting ; 1 - HTML strong tags ; 2 - Show highlighting regexp. and unhighlighted summary [dflt]; 5 - Use HTML strong tags but remove accents from summary before highlighting, provided query was not accented. (range 0 - 7) -SM=<S> - Summary mode. Possible values are 'both' (or 'def') - Display description or query-bias summary and metadata fields listed in the 'SF' option; 'snip' - display a generated snippet; 'meta' - display metadata fields listed in 'SF'.; 'qb' - display a query-biased summary; 'auto' - Print metadata codes if specified in user query.; 'off' - Turn off all summaries. -SQE=<I> - Set max no. of query biased summary excerpts to n (dflt 3). (range 1 - 10000) -all_summary_text=<B> - Is text used for generating summaries required in the result -countUniqueByGroupSensitive=<B> - Treat group names and metadata items case sensitively (default no). [Not CGI] -ctest_mode=<I> - Controls behaviour of padre-sw when -ctest is used. 0: no internal evaluation; 1 - internal evaluation only. Output is brief plain text report of measures; 2 - internal evaluation only. Output in plain text with QBQ output followed by measures; 3 - internal evaluation plus normal CTOUT output in XML (with measures presented as comments) (range 0 - 3) -explain=<B> - Explain rankings by showing score components. (Note that -explain=on turns off result set diversification). -explore=<I> - Show 'explore' links against results. The value specifies how many terms to include in the expanded query. (range 7 - 50) -gscoperesult=<S> - Specifies the bit number that results will be set to in -res gscope or -res docnums modes (dflt 1). -mdsfhl=<B> - Are query terms only highlighted in MDSF metadata summaries -num_ranks=<I> - Limit number of results to n (min = 0, dflt = 10). (range 0 - unlimited) -num_tiers=<I> - Limit number of result list tiers to n (min = 0 (no ,limit), max = 50, dflt no limit) (range 0 - 50) -qieval=<F> - Set the value presented for query independent evidence when using the qiecfg result format. (dflt 0.5). (range 0.000000 - 1.000000) -qwhl=<S> - Determines which parts of a search result are highlighted. S - snippet, M - metadata, U - URL, T - title. E.g. -qwhl=MUT -res=<S> - Set result format. Possible values are: `trec`, `web`, `xml`, `urls`, `qiez`, `qieo`, `gscope`, `docnums`, `ctest`, `qiecfg` or `flcfg`. -results_in_facet_categories=<I> - Include the specified number of pre-computed search results under the rmc count element for metadata facet categories. (range 0 - 100) -rmc_maxperfield=<I> - Set maximum number of RMC items to display per field at n (dflt 100). (range 0 - unlimited) -rmc_sensitive=<B> - Treat facet categories (RMC items) case sensitively (default no). [Not CGI] -show_qsyntax_tree=<B> - Include an SVG representation of the query-as-processed in output. -start_rank=<I> - Present results starting from n (dflt 1). (range 1 - unlimited) -sumByGroupSensitive=<B> - Treat group names case sensitively (default no). [Not CGI] -tierbars=<B> - Display tierbars in result list output (XML and HTML). When turned on (for all -res modes) and -sort is used, results will be first sorted by tier then by the sorting mode, otherwise if -sortall is used then all results will be sorted regardless of tier. -translucent_DLS_fields=<S> - Metadata fields which are translucent. Translucent fields are visible on documents which the user can not see. To include fields 'a' and 'd' set this to '[a,d]'. If collapsing is enabled and the collapsing signature contains only fields defined here than collapsing will be permitted on documents the user can not see. [Not CGI] O. Query interpretation options: -STOP=<S> - Use the stoplist specified in <file> (one word per line) [Not CGI] -binary=<I> - Determines whether or not binary documents are returned in the results. 0 - show all documents; 1 - show only binary documents; 2 - show only non-binary documents. (range 0 - 3) -clive=<S> - Dynamic metacollections. Specifies a component name within a .sdinfo file(s) to make active. Can be set multiple times to enable multiple collections. -daat_termination_type=<I> - Selects how DAAT early exit is determined. 0 - try for d results with every metafield and every component; 1 - try for d results over every component but not necessarily every metafield; 2 - stop a soon as d results are obtained. (d is the parameter to -daat.) (range 0 - 2) -daat_timeout=<F> - Impose a soft timeout (in seconds) on the time taken by the DAAT machinery for one query. (range 0.000000 - 3600.000000) [Not CGI] -dont_estimate_full_matches=<B> - In DAAT mode don't guess the number of full matches when the DAAT depth did not let us processes an entire postings list. -events=<B> - Must be set if event search is to be used -fmo=<B> - Present full matches only. -lang=<S> - If a 2-character language code is specified by this means, then stemmers etc specific to that language will be used, IF AVAILABLE. It is also permissible to use a 5-character code like en_GB, but padre behaviour will be the same as for en. Specifying lang also makes title and metadata sorting of results locale-specific, however support for this on Windows platforms is limited and problematic. -loose=<I> - Phrase looseness in words (min = 0, dflt = 0). (range 0 - unlimited) -max_qbatch=<I> - Terminate batch query processing after the specified number of queries have been processed. (range 1 - unlimited) -max_terms=<I> - Truncate queries after the specified number of terms. If the query is reordered, truncation will occur after reordering. (range 1 - unlimited) -min_truncated_len=<I> - The text part of a query term with a right truncation operator must have at least this length. E.g. if min_truncated_len were 4 funnel* would be accepted but fun* would be processed as fun. (range 0 - 20) [Not CGI] -noexpired=<B> - Exclude expired docs from results. (Nullified by -zom) [Not CGI] -nulqok=<B> - An empty query submitted via CGI will be processed as a null query. The system query must be empty as well. (dflt is to ignore the request). [Not CGI] -phrase_prox_word_limit=<I> - Phrase or proximity terms with more than this number of words will be shortened by deleting words from the right. E.g. If this limit were 4 then `to be or not to be` would be processed as `to be or not` (range 1 - unlimited) [Not CGI] -prox=<I> - Proximity limit in words (min = 0, dflt = 15). (range 0 - unlimited) -qsup=<S> - When blending queries, determines sources of supplementary queries to be tried, with corresponding weights assigned to each source (ranging from 0 to 1). No spaces. 'off' may be specified to disable supplementary queries. E.g. -qsup=SPEL/0.9+USUK/0.4+SYNS/0.1+LANG/0.1. Available sources are: SPEL (spelling suggestions); USUK (table of spelling differences between US and UK English); SYNS (synonyms as defined by the blending.cfg file); LANG (experimental German decompunding) -query_reorder=<B> - Reorder terms in query so that the most discriminating (least common) appear first. Often coupled with -max_terms= -ras=<I> - Remove any stopwords from the query. Possible values: 0 - remove none; 1 - remove dynamically depending on the query; 2 - remove all stopwords (dflt 1). (range 0 - 2) -service_volume=<S> - Either 'high' or 'low'. A convenience setting to increase or reduce allowable query complexity and timeouts according to service volumes -- large or small indexes, high or low query volumes. [Not CGI] -stem=<I> - Controls stemming of queries. 0 - do not stem (dflt); 1 - do not stem (replaces obsolete option); 2 - Stem all query words (light - English/French plural/singular only); 3 - Stem all query words(heavier). (range 0 - 3) -stem_lconly=<B> - When stemming, stem only lowercase query words (to avoid stemming proper names and acronyms). -strip_invalid_utf8=<B> - Normally, invalid UTF-8 characters are removed during indexing. If this hasn't happened. This option allows them to be removed from result packets. -synonyms=<B> - If set, the query processor will expand queries using thesaurus in synonyms.cfg. -truncation_allowed=<I> - Enables the use of the * operator, binary valued, it is only valid in use with an option that disables DAAT mode such as, -service_volume='lo' or -daat=0. When applied all contexts are available such as: *:funnelback, funnel*, *back, and *:*elba*. (range 0 - 3) [Not CGI] -wildcard_thresh=<I> - If the postings list for a term is longer than the specified value (in MB) it will be treated as a wildcard. (range 0 - unlimited) -zom=<B> - Include docs in results even if noindex or killed. P. Query source options: -ctest=<S> - Read a batch of queries from testfile (in C_TEST format). Sets output format to RM_CTEST, but that may be overridden. (See es.csiro.au/C-TEST/ for information about C-TEST.) [Not CGI] -s=<S> - System-generated query inserted behind the scenes by a form or front-end. Q. Quicklinks options: -QL=<I> - Activate QuickLinks facility for default pages down to the specified level. 0 - off; 1 - server root pages; 2 - next level down. (range 0 - 5) -QL_rank=<I> - If QuickLinks capability is active, show quick links for search results down to the specified rank. (range 1 - unlimited) -QL_rank_is_relative=<B> - If true, the value of QL_rank will be interpreted relative to the start_rank. E.g. if QL_rank=2, the first two results on each page may show QuickLinks. R. Ranking options: -SameSiteSuppressionExponent=<F> - Same site suppression penalty exponent (dflt 0.5, recommended range 0.2 - 0.7). (range 0.000000 - unlimited) -SameSiteSuppressionOffset=<I> - Number of additional documents from a site beyond the first that are allowed their full score before applying a same site suppression penalty (dflt 0) (range 0 - 1000) -absscores=<B> - Report content scores as % of max possible Okapi score (Intended for use with -vsimple=on). -anniemode=<I> - Control the use of annotation indexes. 0 - do not use annotation indexes ; 1 - Process queries using annotation indexes only; 2 - Process queries using annotation indexes, falling back to normal indexes if insufficient results. (Most query op.s stripped.) 3 - Process queries using both annotation and normal indexes (Most operators stripped from queries.). Default 0. (range 0 - 3) -b=<F> - Set Okapi b to f (dflt 0.75) (range 0.000000 - unlimited) -cgscope1=<S> - Documents matching this gscope expression (reverse Polish) can be upweighted with -cool.68. Those not matching can be upweighted with -cool.70. -cgscope2=<S> - Documents matching this gscope expression (reverse Polish) can be upweighted with -cool.69. Those not matching can be upweighted with -cool.71. -cool=<B> - Whether to use topic distillation scoring (cool and cooler). Dflt on. -cool.<K=V> - cool.N=V Set a value for the Nth tune parameter. Possible values for N are: 0 | content: Content weight 1 | onlink: On-site link weight 2 | offlink: Off-site link weight 3 | urllen: URL length weight 4 | qie: External evidence (QIE) weight 5 | date_proximity: Date proximity weight 6 | urltype: URL attractiveness 7 | annie: Annotation weight (ANNIE) 8 | domain_weight: Domain weight 9 | geoprox: Proximity to origin 10 | nonbin: Non-binariness 11 | no_ads: Advertisements 12 | imp_phrase: Implicit phrase matching 13 | consistency: Consistency of evidence 14 | log_annie: Logarithm of annotation weight 15 | anlog_annie: Absolute-normalised logarithm of annotation weight 16 | annie_rank: Annotation rank 17 | BM25F: Field-weighted Okapi score 18 | an_okapi: Absolute-normalised Okapi score. 19 | BM25F_rank: Field-weighted Okapi rank 20 | mainhosts: Main hosts bias 21 | comp_wt: Data source component weighting 22 | document_number: Document number in the index 23 | host_incoming_link_score: Host incoming link score 24 | host_click_score: Host click score 25 | host_linking_hosts_score: Host linking hosts score 26 | host_linked_hosts_score: Host linked host score 27 | host_rank_in_crawl_order_score: Host rank in crawl 28 | host_domain_shallowness_score: Domain shallowness 29 | doc_matches_regex: Document URL matches regex pattern 30 | doc_does_not_match_regex: Document URL does not match regex pattern 31 | titleWords: Normalized title words 32 | contentWords: Normalized content words 33 | compressionFactor: Normalized compressibility of document text 34 | entropy: Normalized document entropy 35 | stopwordFraction: Normalized stop word fraction 36 | stopwordCover: Normalized stop word cover 37 | averageTermLen: Normalized average term length 38 | distinctWords: Normalized distinct words 39 | maxFreq: Normalized maximum term frequency 40 | titleWords_neg: Negative normalized title words 41 | contentWords_neg: Negative normalized content words 42 | compressionFactor_neg: Negative normalized compressibility of document text 43 | entropy_neg: Negative normalized document entropy 44 | stopwordFraction_neg: Negative normalized stop word fraction 45 | stopwordCover_neg: Negative normalized stop word cover 46 | averageTermLen_neg: Negative normalized average term length 47 | distinctWords_neg: Negative normalized distinct words 48 | maxFreq_neg: Negative normalized maximum term frequency 49 | titleWords_abs: Absolute title words 50 | contentWords_abs: Absolute content words 51 | compressionFactor_abs: Absolute compressibility of document text 52 | entropy_abs: Absolute document entropy 53 | stopwordFraction_abs: Absolute stop word fraction 54 | stopwordCover_abs: Absolute stop word cover 55 | averageTermLen_abs: Absolute average term length 56 | distinctWords_abs: Absolute distinct words 57 | maxFreq_abs: Absolute maximum term frequency 58 | titleWords_abs_neg: Negative absolute title words 59 | contentWords_abs_neg: Negative absolute content words 60 | compressionFactor_abs_neg: Negative absolute compressibility of document text 61 | entropy_abs_neg: Negative absolute document entropy 62 | stopwordFraction_abs_neg: Negative absolute stop word fraction 63 | stopwordCover_abs_neg: Negative absolute stop word cover 64 | averageTermLen_abs_neg: Negative absolute average term length 65 | distinctWords_abs_neg: Negative absolute distinct words 66 | maxFreq_abs_neg: Negative absolute maximum term frequency 67 | lexical_span_score: Lexical span 68 | doc_matches_cgscope1: Document matches `cgscope1` 69 | doc_matches_cgscope2: Document matches `cgscope2` 70 | doc_does_not_match_cgscope1: Document does not match `cgscope1` 71 | doc_does_not_match_cgscope2: Document does not match `cgscope2` 72 | raw_annie: Raw ANNIE -daat=<I> - Specifies the maximum number of full matches for Document-At-A-Time processing. If set to 0, Term-At-A-Time is used instead (dflt 5000). (range 0 - 10000000) -diversity_rank_limit=<I> - Diversification won't alter ranking beyond rank n (default 200, min 10). (range 10 - unlimited) -facet_url_prefix=<S> - Present only results whose URL is prefixed by the given URL. Note that the scheme and hostname part are case insensitive, for URI with scheme smb:// the entire prefix is case insensitive. The behaviour of this option may change in the future to suit facets, this should not be used outside of faceted navigation. [Not CGI] -gscope1=<S> - Present only results whose gscope bits match reverse Polish expression `e` (Bits numbered from zero). If set to `off`, disable any previous expression. -k1=<F> - Set Okapi K1 to <f>. (dflt 2.0) (range 0.000000 - unlimited) -kmod=<I> - Select special scoring function i for special fields. 0 = normal, 1 = AF1 (dflt 1). (range 0 - 1) -lscope=<S> - Present only results whose URL matches a sort-of left-anchored pattern. -lscorrect=<B> - Whether to correct link scores across meta collection components (default yes). -main_homepage_factor=<F> - Penalise score of the homepage of a single-entity-controlled domain to prevent over representation in results sets. E.g. www.anu.edu.au/ in an index of ANU. (dflt 0.90) (range 0.000000 - 1.000000) -meta_suppression_field=<S> - If same_meta_suppression is activated, the specified metadata field will be the field to which it applies. Only one metadata field can be treated in this way. -near_dup_factor=<F> - The query processor will penalise a result which is a near-duplicate of a previous result by multiplying by the factor specified. The penalty stiffens with more repetition. (dflt 0.5) (range 0.000000 - 1.000000) -promote_urls=<S> - Insert the specified URLs at or near the top of the results list for a query. Value is a space separated list of URLs. URLs must correspond to those recorded by padre-iw. (dflt Inactive) -quanta=<I> - Set the number of possible score quantisation levels for each cool variable. In general, a high number should give more accurate ranking but may slow query processing. (range 10 - 100000) -rank_limit=<I> - Limit highest rank requestable to n (dflt 1,000,000,000). (range 10 - unlimited) -ranking_profile=<I> - Choose a profile of settings for the ranking function. 0 - current default; 1 - Standard BM25; 2 - Traditional (pre-12.0) Funnelback. Setting a profile does not override explicit settings. (range 0 - 100) [Not CGI] -recency_decay_vals=<S> - <z,w,m,y,d,c,m> - Define how recency scores decay with time. z w, m, y, d, c, m are floats in the range 0 - 1, which specify the recency score assigned to documents, 0 days, 1 wk, 1 mth, 1 yr, 1 dec, 1 cen, 1 mill. old. (dflt 1.0,0.75,0.5,0.25,0.025,0.0025) Recency scores between key values linearly interpolated. Past the millennium, recency scores are 1/daysold. -reference_date=<S> - If specified, recency is based on this date rather than that of most recent doc. Format is <yyyymmdd>, or 'today'. -remove_urls=<S> - Prevent the specified URLs from appearing in the results for a query. Value is a space separated list of URLs. URLs must correspond to those recorded by padre-iw. (dflt Inactive) -sco=<S> - <n>[<classes>] Set doc scoring mode to n, using the classes specified. Most common values: 0 - score using doc text only ; 1 - no scoring. Produce an unordered set of results ; 2 - score using anchortext and URLs as well, upweight titles (or whatever fields are configured with -specf). For example to automatically look in fields 'u' and 'v' for the query terms set -sco=2[u,v] -scope=<S> - Present only results whose URL satisfies the include/exclude scopes included in list (comma separated). e.g. -scope=anu.edu.au,-anu.edu.au/archives -sort=<S> - Sort top results by <string>. Possible values: 'date', 'adate' (ascending date), 'title', 'dtitle' (descending title), 'size' (file size), 'dsize' (descending filesize), 'url', 'durl' (descending url), 'coll' (collection name, then score), 'dcoll' (descending collection name, then score), 'meta<f>' (by metadata field f, then score),'dmeta<f>' (descending metadata field d, then score), 'shuffle' (random to avoid bias), 'collapse_count' (to order by the number of collapsed documents, with the largest collapsed set first), 'acollapse_count' (with the largest collapsed set last), 'prox' (for geo search: Sort top results by proximity to origin), 'dprox' (for geo search: Sort top results by descending proximity to origin). 'score_ignoring_tiers' (descending score, ignoring any tiers. Only useful with sortall.) (dflt is case-insensitive for title and meta). '-sort=' turns off sorting. -sort_sensitive=<B> - Use case-sensitive sorting when sorting results by title or metadata strings. -sortall=<B> - Include partial matches in the resorting performed by -sort. -specf=<S> - Fields listed in string s, as a list of comma separated fields surrounded by square brackets, will be scored specially and added to query when using the -sco=2 mode (dflt '[k,K]'). -sss_defeat_pattern=<S> - URLs matching the specified pattern (currently a simple string match) will not be subject to samesite suppression. -static_cool_exponent=<F> - Control the extent to which static scores are attenuated with length of query. 0 => no attenuation; 1 => max attenuation. Attenuation by len ** -f. (range 0.000000 - 1.000000) -unknown_daysold=<I> - A doc with unknown date is assumed to be d days old (for recency calcs) (dflt 366). (range 0 - unlimited) -use_Paik=<B> - Use the tf.idf scheme proposed by Jiaul Paik at SIGIR 2013 rather than the more conventional BM25 variant. -use_secds=<B> - When working with domain-importance features in ranking, use SECDs if value is on, and raw domain names otherwise. -vsimple=<S> - Very simple ranking. If set to 'on', equivalent to -sco=0 -cool=off -SSS=0 -kmod=0. -weight_only_fields=<S> - Documents will not be retrieved in DAAT mode if they only match unfielded query terms in one or more of the implicit fields listed here. For example, specifying '[K,k]' will stop the query 'Monica Lewinski' matching a document solely because of click data or referring anchortext. -wmeta.<K=V> - wmeta.C=F Set upweighting factors for metadata class scoring. C - metadata class; F - weight to set. (dflt 0.5 for 'k' and 'K', 1 for everything else). -xscope=<S> - Present only results whose URL exactly matches the provided URL (after canonicalization). S. Ranking - Result diversification options: -SSS=<I> - Same site suppression depth: 0 - no suppression (dflt); 2 - hosts and their top level dir's; 10 - org domain (includes sub-domains) e.g. defence.gov.au. (range 0 - 10) -neardup=<F> - Near dupulicates in ranking are multiplied by f. Setting f to 1 turns off near-dup detection. (range 0.000000 - 1.000000) -repetitiousness_factor=<F> - Penalise a repetitious result by multiplying by the factor specified. (Repetitiousness may involve same-site, same component or repeated metadata.) The penalty stiffens with more repetition. Setting to 1 turns this off. (dflt 1.0) (range 0.000000 - 1.000000) -same_collection_suppression=<F> - While searching a meta-collection, penalise the second result from the same primary collection as a previous result by multiplying by the factor specified. The penalty stiffens with more repetition. Setting to 0 turns this off. (dflt 0) (range 0.000000 - 1.000000) -same_meta_suppression=<F> - Penalise the second result with the same value for a specified metafield as a previous result by multiplying by the factor specified. The penalty stiffens with more repetition. Setting to 0 turns this off. (dflt 0) (range 0.000000 - 1.000000) -title_dup_factor=<F> - The query processor will penalise a result which has exactly the same title as a previous result by multiplying by the factor specified. The penalty stiffens with more repetition. Setting to 1 turns this off. (dflt 0.5) (range 0.000000 - 1.000000) T. Result collapsing options: -collapsing=<B> - Activate collapsing. Collapsing will be based on document content ('$') unless a collapsing_sig value is specified. Note that use of this option will disable result set diversification. -collapsing_SF=<S> - Metadata fields to include in display for collapsed documents (assuming collapsing_num_ranks is non-zero). (dflt no fields). To view metadata fields 'id' and 'a' set this to '[id,a]'. -collapsing_label=<S> - Label to indicate why items have been collapsed. (dflt "which are very similar") -collapsing_num_ranks=<I> - Specify how many collapsed results are to be shown under the uncollapsed ones. (dflt 0) (range 0 - 1000) -collapsing_scoped=<B> - Scope to only documents which have been collapsed on. Default is off. -collapsing_sig=<S> - The collapsing_control segment to use when collapsing. E.g. "[a,p]", collapse on author+publisher. The value must correspond to one segment of the indexing.collapse_fields string. (A segment is a comma separated list of fields surrounded by square brackets) (dflt '[$]' (Collapsing on document content.)) U. Security options: -dls_internal_test=<I> - This allows testing of the padre side of the custom document level security mechanism. There is no call out to an external function. The value is interpreted as a combination of bits: 1 bit - dls_internal_test is active/not active; 2 bit - selects whether MINRESULTS mode is used or not. During internal testing, every odd numbered document in the original ranking is arbitrarily treated as inaccessible. (range 0 - unlimited) -ipreject=<S> - `QUERY_LIMIT,WINDOW_SECONDS,UPPER_QUERY_LIMIT` - Use an IP rejector to limit requests from a single machine. Allow `QUERY_LIMIT` queries per `WINDOW_SECONDS`, don't record more than `UPPER_QUERY_LIMIT` queries. [Not CGI] -ldLibraryPath=<S> - Full path to security plugin library [Not CGI] -locking_model=<S> - Name of locking model, either "trim" or "sharepoint". [Not CGI] -no_security=<B> - Disable DLS, available as a command line option. [Not CGI] -secPlugin=<S> - Name of security plugin library [Not CGI] -translucent_DLS=<B> - Enables translucent DLS DAAT only. [Not CGI] -userkeys=<S> - Conduct this search with security keys specified by s. The format is '<collectionName>;key<delim>' where delim is either ',' or new line, spaces are removed for example 'c1;k1 c2;k1,c2;k2' [Not CGI] V. Spelling options: -spelling=<B> - Activate spelling suggestion mechanism. -spelling_alpha=<F> - Set the weighting between 'closeness to the query' and support in the collection for a candidate suggestion. Big alpha, high weight on closeness to the query. (dflt 0.7) (range 0.000000 - 1.000000) -spelling_blend_thresh=<F> - Confidence threshold for automatically blending results for a query suggestion with those from the user's original query. (dflt 0.67) (range 0.000000 - 1.000000) -spelling_difflen_thresh=<I> - Don't make suggestions more than i characters longer or shorter than query. (dflt 2) (range 0 - 1000) -spelling_dym_thresh=<F> - Confidence threshold for making a 'Did you mean' suggestion. (dflt 0.5) (range 0.000000 - 1.000000) -spelling_edist_constant=<F> - Don't make suggestions whose edit distance from the query exceeds f + query_length * spelling_edist_proportion. (dflt 1) (range 0.000000 - 1000.000000) -spelling_edist_proportion=<F> - Don't make suggestions whose edit distance from the query exceeds spelling_edist_constant + query_length * f (0<=f<=1). (dflt 0.25) (range 0.000000 - 1.000000) -spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are at least f * log10(num docs) full matches. (dflt 30.0) (range 0.000000 - unlimited) -spelling_fullmatch_trigger_const=<F>- Don't look for suggestions if there are at least f * log10(num docs) full matches. (dflt 30.0) (range 0.000000 - inf) -spelling_include_context=<B> - Include the non-corrected part of the query in the suggestion link. (dflt 1) -spelling_min_querylen=<I> - Suggestions not made for queries shorter than this. (dflt 2) (range 1 - 1000) -spelling_wt_thresh=<F> - Sets a threshold that determines if a spelling suggestion is returned. If the generated spelling suggestion weight is less than this, the suggestion is not returned. (dflt 0.01) (range 0.000000 - 100.000000) W. TREC specific options: -trec_runid=<S> - For TREC participation: Each result in TREC format will include this runid. -trec_topic=<I> - For TREC participation: The first query in a batch will get this topic number. Each new query will increase the number by one. (range 0 - unlimited) -trecids=<B> - For TREC participation: Each result in TREC format will use the TREC docno rather than a URL
Missing required argument -input=<input file> List the top-k most frequent items Usage: padre-topk -input=<input file> [-capacity=<integer>] [-k=<integer>] Where: * -input: is the file of items where each item is delimited by new line * -capacity: this is the limit as to the number of items that will be held in memory at once. * -k: this is the number of items that will shown at the end. Efficiently (compared to some other algorithims) estimates the count of the most frequent top-k items e.g. for a,b,c,a,b,a the top-2 would be a with a count of 3 followed by b with a count of 2
Purpose: Efficient location of all lines in a sorted text file which match a prefix. Usage: pan-look <prefix> <file name>
Purpose: Extract frequently occurring word tuples ('phrases') from a collection. Usage: phrasefinder <stem> [-unco] [-hash_limit=<i>] [-num_to_show=<i>] [-max_tuple=<i>] [-debug] phrasefinder reads the .results file of <stem> and locates candidate word tuples ('phrases') up to a configurable maximum length. Phrases are sequences of up to max_tuple words which are unbroken by anything other than space. Candidates are stored in a hashtable and counted. A limit of hash_limit candidates is stored. Once this is reached, the program exits. (Useful for testing or for limiting execution time and virtual memory requirements.) When processing finishes, the top num_to_show candidates are sorted in descending order of frequency and output in <stem>.phrases, in the same format as the .lex file, with word breaks represented by hyphens.
Purpose: Runs a command with a exclusive file lock held. Usage: run-with-flock file_lock_path lock_acquired_path cmd arg0 arg1 ... argn This will open (and create) 'file_lock_path' and then will take a exclusive file lock on the path after that it will create the file: 'lock_acquired_path', if all of that succeeds then the given command will be executed while the lock is held.
Purpose: Given an annotation index, summarise the annotations applied to a given URL. Usage: show_annotations_for_doc <index_stem>|<collection> <URL>|<URL64>|<DOCNO> [-csv|-html|-xml] - Show the annotations applying to the specified URL and their weights. - Don't forget to quote or escape shell metacharacters in a URL! - Default output format is XML.
Purpose: Apply the tags in a tag mapping file to a PADRE index. Usage1: url_tagger stem (-clear|<url-tags-file>) Usage 2: url_tagger -v url_tagger -clear clears all tags from all documents in the index url_tagger <url-tags-file> takes an url-tags file and applies it to the relevant entries in <stem>.results. url_tagger -v shows version information. Lines in the url-tags file are in the form: <url> <comma-separated-taglist> It is assumed that <url> contains no space and tags contain no commas.