Funnelback logo

Documentation

CATEGORY

Indexer options (collection.cfg)

Description

This option specifies additional configuration options that can be supplied to the indexer when indexing collections. The PArallel Document Retrieval Engine (PADRE) indexer is a powerful engine that can be finely controlled through a large list of options that can be given to it. These options can be specified in this collection configuration parameter. The list of options available is given here.

Caveats

  • Indexing will not occur if the indexer is given an invalid option.
  • Indexer options can affect Funnelback's performance, so change them with caution.
  • Options in group A are generally useful only when running PADRE from a command line and usually should not be included in the index_options

A. Getting information about PADRE and its operation

Option Explanation
-V Print PADRE version number and exit.
-ixform Print index format version created by this indexer exit.
-help Print this list and exit.
-debug Generate debugging output.
-show Show code bits generated (for debugging).
-quiet Use terse logging.
-ankdebug Generate debugging output relaing to anchortext.
-termdeb<term> print debugging messages relating to the indexing of <term>.

B. Controlling what is indexed

Option Explanation
-nometa Don't index any metadata except t, d and k (titles, dates and links).
-dias Don't index link anchor source as part of source documents (<a> only).
-ibd Index all documents even if they appear to be binary.
-ixcom Index words in HTML and XML comments.
-select<num1>,<num2> Index every num1th document/bundle starting from num2th(from zero).

C. Controlling how things are indexed

Option Explanation
-noax Don't conflate accents.
-QL_depth=<num> Activate quicklinks on default pages of up to depth <num>. Use internal QL defaults. (Default 0 = Off).
-QL_config=<file> Activate quicklinks. Read quicklinks configuration options from file <file>.
-forcexml Use the XML parser on all documents.
-vbyte Use variable-byte compression of inverted file. (Default).
-elias Use Elias gamma compression of inverted file.
-case Store case information in postings. Requires Elias compression. Queries will not be affected unless "-case" is also passed to the query processor.
-SORTSIG<num> How many [UTF-8] characters in a word are significant.
-dilw Don't index very long words.

D. Controlling metadata indexing

Option Explanation
-nomdsfconcat Don't concatenate stored metadata strings (store only the first). Note: subsequent strings are still indexed.
-XMF<file> <file> specifies a file defining XML field mappings.
-MMF<file> <file> specifies a file defining meta tag mappings.
-ifb Index a special word '$++' (index field boundary) at the start and end of each metadata field (used in facets).
-facet_item_sepchars=<string> Specify which chars are used to separate metadata facet items. The default value is the pipe character (i.e. |)
-mapA Map anchor text in source file to A:
-EM<file> <file> is a file of external metadata.
-NIM Ignore explicitly specified internal metadata.
-collfield=<f> Index the name of a collection as metadata in each document and assign to field f.

E. Controlling link and anchortext handling

Option Explanation
-noank_record Don't extract, record or index anchortext.
-noank_index Extract and record but don't index anchortext.
-dpdf Produce but don't process the anchors distilled file.
-nep_action={0,1,2} Controls handling of nepotistic links.
0 - Handle as normal links,
1 - Ignore links of types greater than nep_limit.
2 - Limit the number of repetitions of links of types greater than nep_limit (default).
-nep_limit={0,1,2,3} Controls the types of links which are considered to be nepotistic.
0 - Unaffiliated links from outside the target domain.
1 - Links from a different host.
2 - Links from the same or a closely affiliated host.
3 - Dynamically generated links from such a host.
-nep_cachebits={i} Limits the size of the 'low value' link cache to 2^i.
-noaltanx Don't index image alt as anchortext when an image is an anchor.
-nosrcanx Don't index image src as anchortext when an image is an anchor.
-BL<f> <f> is a file of source URL patterns from which links should be ignrored or treated with suspicion (Blacklist).
-AD<f> <f> is a file of SECD (single entity controlled domain) affiliations. e.g. griffith.edu.au --> gu.edu.au. Links to an affiliated SECD are classified as within-domain.
-RP<f> <f> is a file of CGI parameters which should be removed from source and target URLs. The special value conf_file can be used (ie. "-RPconf_file") to tell the indexer to use the value of crawler.remove_parameters from collection.cfg instead of specifying an external file.
-A<pat> <pat> is an acceptable link target pattern.
-F<file> *<file> is an additional anchor text file.
-FN<file> Like -F but source URLs should need not be looked up.
-RD<dir> *<dir> is a directory in which to look for redirects and duplicates files. (produced by FunnelBack etc. & PADRE).
-igmaf Ignore main anchors file.
-mule<n> Discard links to URL targets longer than <n> chars. Default is no limit.
-rmat Record targets of failed anchor lookups via stdout.

F. Controlling which index files are generated

Option Explanation
-nomdsf Suppress generation of the .mdsf file.
-nolex Suppress generation of the .lex file.
-exlens Create .dlx file with explicit lengths for each document field.
-cleanup Remove superfluous files from the index directory after index has completed.
-nosigs Suppress the calculation of document text signatures and the production of .textsig file.

G. Setting size limits

Option Explanation
-GSB<n> How many bytes to allocate for gscope flags. The default is 8 bytes (i.e 64 flags) and the minimum is 2 bytes (i.e. 16 flags).
-big<N> Multiply word table sizes by 2^N from base of 256K. Default table size is 8M (ie. -big5).
-small Divide word table sizes by 4 from base of 256K (i.e. use 64K).
-chamb<num> Set decompression chamber size to <num> MB.
-RSDTF<num> Set maximum characters in description & title fields in .results to <num>. Default 256.
-RSTXT<num> Set maximum characters in summarisable text per doc in .results to <num>. Default 10000.
-W<num> Index-writing window will be <num> MB (Larger windows mean faster indexing at the expense of using more RAM).
-MWIPD<num> Maximum words indexed per document (excluding anchors).
-maxdocs<num> Maximum no. of documents to index. Others are ignored.
-mdsfml<n> Set maximum length for strings in .mdsf file. Default 512.
-99% Limit on how full the word hash table can get.

H. Special indexing modes

Option Explanation
-future_dates_ok Allow indexing of documents with dates in the future (instead of assuming the current date in these cases).
-paidads If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).
-nz<num> Adjusts special processing for Māori, specifically handling of macrons. Valid <num> values are 0 (default - no special processing), 1 (some processing) and 2 (all processing).
-iolap Overlap reading of bundles with processing them.
-utf8input Assume all input files whose charset is not specified are UTF-8 encoded.
-isoinput Assume all input files whose charset is not specified are ISO_8859-1 encoded (This is the default).
-force_iso Forcibly assume all input files are ISO_8859-1 encoded.
-URLP<str> When storing documents URLs, prepend <str>. (This is only used if the document does not indicate its own URL with a BASE HREF element)
-lmd HTTP LastModified date takes priority over metadata dates.
-DT<str> Interpret <str> as start of new doc within bundle. (Not a regular expression). (note that there is a separate mechanism for XML).
-annie[<exec>] After normal indexing is complete, build an ANNIE (annotation) index using the specified executable. Default executable is annie-i
-bigweb Space saving option for bundled large crawl indexes. Roughly equivalent to: -nomdsf -big7 -MWIPD2000 -W2000 -SORTSIG16 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet . A broader definition is taken of link nepotism. You can add e.g. -Axxx.com to cut anchor processing time. (Don't forget to make dupredrex.txt in index directory.)

I. Miscellaneous options

Option Explanation
-O<name> <name> is the name of this organisation.
-T<path> Specify a large temporary filespace for use by the indexer.

S. Security options

Option Explanation
-check_url_exclusion=<on/off> URLs matching url_exclusion pattern will not be searchable. (Default on.)
-url_exclusion_pattern=<regex> exclusion pattern to use if URLs are vetted. (Default 'file://$SEARCH_HOME/')

Default value

indexer_options=

That is, no additional options.

Examples

To allocate 80 (10 x 8) gscopes and index no more than 1000 documents

indexer_options=-GSB10 -maxdocs1000

See also

top ⇑