Funnelback logo



Indexer options (collection.cfg)


This option specifies additional configuration options that can be supplied to the indexer when indexing collections. The PArallel Document Retrieval Engine (PADRE) indexer is a powerful engine that can be finely controlled through a large list of options that can be given to it. These options can be specified in this collection configuration parameter. The list of options available is given here.


  • Indexing will not occur if the indexer is given an invalid option.
  • Indexer options can affect Funnelback's performance, so change them with caution.
  • Options in group A are generally useful only when running PADRE from a command line and usually should not be included in the index_options

A. Getting information about PADRE and its operation

Option Explanation
-V Print PADRE version number and exit.
-ixform Print index format version created by this indexer exit.
-help Print this list and exit.
-debug Generate debugging output.
-show Show code bits generated (for debugging).
-quiet Use terse logging.
-ankdebug Generate debugging output relaing to anchortext.
-termdeb<term> print debugging messages relating to the indexing of <term>.

B. Controlling what is indexed

Option Explanation
-nometa Don't index any metadata except t, d and k (titles, dates and links).
-dias Don't index link anchor source as part of source documents (<a> only).
-ibd Index all documents even if they appear to be binary.
-ixcom Index words in HTML and XML comments.
-select<num1>,<num2> Index every num1th document/bundle starting from num2th(from zero).

C. Controlling how things are indexed

Option Explanation
-noax Don't conflate accents.
-QL_depth=<num> Activate quicklinks on default pages of up to depth <num>. Use internal QL defaults. (Default 0 = Off).
-QL_config=<file> Activate quicklinks. Read quicklinks configuration options from file <file>.
-forcexml Use the XML parser on all documents.
-vbyte Use variable-byte compression of inverted file. (Default).
-elias Use Elias gamma compression of inverted file.
-case Store case information in postings. Requires Elias compression. Queries will not be affected unless "-case" is also passed to the query processor.
-SORTSIG<num> How many [UTF-8] characters in a word are significant.
-dilw Don't index very long words.

D. Controlling metadata indexing

Option Explanation
-nomdsfconcat Don't concatenate stored metadata strings (store only the first). Note: subsequent strings are still indexed.
-XMF<file> <file> specifies a file defining XML field mappings.
-MMF<file> <file> specifies a file defining meta tag mappings.
-ifb Index a special word '$++' (index field boundary) at the start and end of each metadata field (used in facets).
-facet_item_sepchars=<string> Specify which chars are used to separate metadata facet items. The default value is the pipe character (i.e. |)
-mapA Map anchor text in source file to A:
-EM<file> <file> is a file of external metadata.
-NIM Ignore explicitly specified internal metadata.
-collfield=<f> Index the name of a collection as metadata in each document and assign to field f.

E. Controlling link and anchortext handling

Option Explanation
-noank_record Don't extract, record or index anchortext.
-noank_index Extract and record but don't index anchortext.
-dpdf Produce but don't process the anchors distilled file.
-nep_action={0,1,2} Controls handling of nepotistic links.
0 - Handle as normal links,
1 - Ignore links of types greater than nep_limit.
2 - Limit the number of repetitions of links of types greater than nep_limit (default).
-nep_limit={0,1,2,3} Controls the types of links which are considered to be nepotistic.
0 - Unaffiliated links from outside the target domain.
1 - Links from a different host.
2 - Links from the same or a closely affiliated host.
3 - Dynamically generated links from such a host.
-nep_cachebits={i} Limits the size of the 'low value' link cache to 2^i.
-noaltanx Don't index image alt as anchortext when an image is an anchor.
-nosrcanx Don't index image src as anchortext when an image is an anchor.
-BL<f> <f> is a file of source URL patterns from which links should be ignrored or treated with suspicion (Blacklist).
-AD<f> <f> is a file of SECD (single entity controlled domain) affiliations. e.g. --> Links to an affiliated SECD are classified as within-domain.
-RP<f> <f> is a file of CGI parameters which should be removed from source and target URLs. The special value conf_file can be used (ie. "-RPconf_file") to tell the indexer to use the value of crawler.remove_parameters from collection.cfg instead of specifying an external file.
-A<pat> <pat> is an acceptable link target pattern.
-F<file> *<file> is an additional anchor text file.
-FN<file> Like -F but source URLs should need not be looked up.
-RD<dir> *<dir> is a directory in which to look for redirects and duplicates files. (produced by FunnelBack etc. & PADRE).
-igmaf Ignore main anchors file.
-mule<n> Discard links to URL targets longer than <n> chars. Default is no limit.
-rmat Record targets of failed anchor lookups via stdout.

F. Controlling which index files are generated

Option Explanation
-nomdsf Suppress generation of the .mdsf file.
-nolex Suppress generation of the .lex file.
-exlens Create .dlx file with explicit lengths for each document field.
-cleanup Remove superfluous files from the index directory after index has completed.
-nosigs Suppress the calculation of document text signatures and the production of .textsig file.

G. Setting size limits

Option Explanation
-GSB<n> How many bytes to allocate for gscope flags. The default is 8 bytes (i.e 64 flags) and the minimum is 2 bytes (i.e. 16 flags).
-big<N> Multiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5).
-small Divide word table sizes by 4 from base of 256K (i.e. use 64K).
-chamb<num> Set decompression chamber size to <num> MB.
-RSDTF<num> Set maximum characters in description & title fields in .results to <num>. Default 256.
-RSTXT<num> Set maximum characters in summarisable text per doc in .results to <num>. Default 10000.
-W<num> Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM).
-MWIPD<num> Maximum words indexed per document (excluding anchors).
-maxdocs<num> Maximum no. of documents to index. Others are ignored.
-mdsfml<n> Set maximum length for strings in .mdsf file. Default 512.
-99% Limit on how full the word hash table can get.

H. Special indexing modes

Option Explanation
-future_dates_ok Allow indexing of documents with dates in the future (instead of assuming the current date in these cases).
-paidads If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
-nz<num> Adjusts special processing for Māori, specifically handling of macrons. Valid <num> values are 0 (default - no special processing), 1 (some processing) and 2 (all processing).
-iolap Overlap reading of bundles with processing them.
-utf8input Assume all input files whose charset is not specified are UTF-8 encoded.
-isoinput Assume all input files whose charset is not specified are ISO_8859-1 encoded (This is the default).
-force_iso Forcibly assume all input files are ISO_8859-1 encoded.
-URLP<str> When storing documents URLs, prepend <str>. (This is only used if the document does not indicate its own URL with a BASE HREF element)
-lmd HTTP LastModified date takes priority over metadata dates.
-DT<str> Interpret <str> as start of new doc within bundle. (Not a regular expression). (note that there is a separate mechanism for XML).
-annie[<exec>] After normal indexing is complete, build an ANNIE (annotation) index using the specified executable. Default executable is annie-i
-bigweb Space saving option for bundled large crawl indexes. Roughly equivalent to: -nomdsf -big7 -MWIPD2000 -W2000 -SORTSIG16 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet . A broader definition is taken of link nepotism. You can add e.g. to cut anchor processing time. (Don't forget to make dupredrex.txt in index directory.)

I. Miscellaneous options

Option Explanation
-O<name> <name> is the name of this organisation.
-T<path> Specify a large temporary filespace for use by the indexer.

S. Security options

Option Explanation
-check_url_exclusion=<on/off> URLs matching url_exclusion pattern will not be searchable. (Default on.)
-url_exclusion_pattern=<regex> exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/')

Default value


That is, no additional options.


To allocate 80 (10 x 8) gscopes and index no more than 1000 documents

indexer_options=-GSB10 -maxdocs1000

See also

top ⇑