Padre Indexer options

Indexer options

Indexer options are used to control how the Funnelback search index is built. These configuration options can be supplied as a space-separated list of options to the indexer via the index_options data source configuration option. The options can be specified in any order.

For example, set the maximum number of documents to index to 1000, and increase the size of an indexed metadata field to 1500 characters:

indexer_options= -maxdocs1000 -mdsfml1500
If you change or add any indexer options you need to rebuild your index for the changes to take effect. For a push data source this requires a vacuum to be executed via the push API. For other data source types you need to run an advanced update to rebuild the live index. (A normal update that runs will also apply any changes, but takes longer to run).

Controlling what is indexed

-nometa

Don’t index any metadata except t, d and k (titles, dates and links).

-nomdsfconcat

Don’t concatenate strings in the mdsf file. Record first only. (Others are still indexed.)

-diwimuu

Don’t index words in made-up URLs (those constructed from filepath).

-dias

Don’t index link anchor source as part of source documents (<a> only).

-ibd

Index all documents even if they appear to be binary.

-ixcom

Index words in HTML and XML comments.

-select<num1>,<num2>

Index every num1th file/bundle starting from num2th(from zero).

-select-doc-in-bundle=<interval>,<offset>

Index every <interval> document within a bundle starting from <offset> (which starts at zero). Only works with warc store.

-tarpat<regex>

Filenames in a .tar file being indexed must match regex. Default is to match everything.

-check_url_exclusion=<on|off>

URLs matching url_exclusion pattern will not be searchable. (Default: on.)

-url_exclusion_pattern=<regex>

exclusion pattern to use if URLs are vetted. (Default file://$SEARCH_HOME/)

-filepath_exclusion_pattern=<regex>

exclusion pattern to use if files are to be excluded from indexing on the basis of the filepath. If applicable, this is more efficient than excluding by URL because the URL can’t be finally determined until the content has been scanned. (Default: not set)

-index_subversion_dirs

Normally the .svn directories created by the subversion version control system are not indexed. Override this default.

Controlling how things are indexed

-noax

Don’t conflate accents.

-unimap=<mapname>

specify a Unicode mapping to be applied when indexing and when query processing. Supported values: tosimplified, and totraditional. (Chinese only.)

-deutsch=<i>

How much extra processing is done for umlaut and sz.

  • 0: none. München is indexed as München and Munchen_

  • 1: München is indexed as München, Muenchen and Munchen (Dflt)

  • 2: As for 1 but also Muenchen is indexed as München, Muenchen and Munchen

    NOTE: As a side-effect to allow for compounds, `SORT_SIGNIF` is increased to `40`
-nz=<i>

How much extra processing is done for Māori.

  • 0: none. Māori and Mäori are indexed as Māori or Mäori respectively and Maori (Dflt)

  • 1: Māori is indexed as Māori, Maaori and Maori; and Mäori is indexed as Mäori, Māori, Maaori and Maori

-no_cjkt_grams

Suppress the indexing of bigrams/unigrams in CJKT text. It is assumed that said text has been pre-segmented into words, and that normal word-based indexing is needed.

-QL_depth=<i>

Activate quick links on default pages of up to depth <i>. Use internal quick links defaults. (Dflt 0 = Off)

-QL_config=<f>

Activate quicklinks. Read quicklinks configuration options from file <f>.

-docscan_depth=<i>

When trying to determine document type and charset, the indexer will look up to <i> chars into the document. (Dflt 20480)

-forcexml

Use the XML parser on all documents.

-case

Store case information in postings. Currently unsupported. Note that setting this reduces the approximate max number of unique terms from ~950M to ~240M.

-SORTSIG<num>

How many UTF-8 characters in a word are significant. Default 30

-dilw

Don’t index words or use words in summaries that are longer than what is set by -SORTSIG.

Controlling metadata indexing

-xml-config=<file>

<file> specifies a file defining XML indexing configurations in json format.

-MM=<file>

<file> specifies a file defining metadata mappings for both HTML and XML documents.

-ifb

Index a special word $++ at the start and end of each metadata field (on by default).

-noifb

Do not index a special word $++ at the start and end of each metadata field.

-facet_item_sepchars=<string>

Which chars are used to separate metadata facet items. (Dflt '|')

-map[<f>]

Map anchor text in source file to metadata class <f>. If <f> is absent, outgoing anchor text is un-fielded content. (dflt <f> absent)

-EM<file>

<file> is a file of external metadata, multiple files may be supplied by setting this multiple times.

-externalMetaErrorThreshold=<num>

<num> is a configurable percentage for the error rate allowed in processing the external metadata file(s). It varies from 0 to 100. Default 10

-NIM

Ignore explicitly specified internal metadata.

-collfield=<f>

Index the name of the data source as metadata in each document and assign to field <f>.

-collection_name=

Set the name of the data source being indexed.

-metadata_topk_capacity=<I>

Sets the maximum number of metadata names or XML paths PADRE will keep track of for counting the most frequent metadata or xpath that could be mapped.

-metadata_topk_k=<I>

Sets the number of the most frequent metadata names or XML paths PADRE should report on after indexing.

-noank_record

Don’t extract, record or index anchor text. .anchors.gz file not processed. No link counts possible.

-noank_index

Extract and record but don’t index anchor text. .anchors.gz file can be post-processed by annie-a

-nocanon

Don’t canonicalize URLs when storing URLs or matching anchortext.

-canon.anchor_collapse=<on|off>

Controls whether PaDRE should canonicalise URLs with fragments (anchors)

  • on: PaDRE will drop any characters with the anchor symbol (#).

  • off: PaDRE will treat URLs with anchors as unique URLs.

-dpdf

Produce but don’t process the anchors distilled file.

-nep_action=<0|1|2>

Action to take for nepotistic links:

  • 0: treat the same as other links.

  • 1: ignore links of types greater than nep_limit.

  • 2: limit the number of repetitions of links of types greater than nep_limit. (dflt)

-nep_limit=<0|1|2|3>

Ignore nepotistic links of types greater than the limit:

  • 0: unaffiliated links from outside the target domain.

  • 1: links from a different host.

  • 2: links from the same or a closely affiliated host.

  • 3: dynamically generated links from such a host.

-nep_cachebits=<i>

Don’t let the low-value link cache grow above 2i

-noaltanx

Don’t index image alt as anchor text when an image is an anchor.

-nosrcanx

Don’t index image src as anchor text when an image is an anchor.

-BL<f>

<f> is a file of source URL patterns from which links should be ignored or treated with suspicion (blacklist).

-AD<f>

<f> is a file of SECD (single entity controlled domain) affiliations. e.g.

griffith.edu.au --> gu.edu.au
Links to an affiliated SECD are classified as within-domain.
-RP<f>

<f> is a file of CGI parameters which should be removed from source and target URLs. Padre generates a regular expression from the lines in <f>. If <f> is conf_file the regex is taken from crawler.remove_parameters in the data source configuration.

-A<pat>

<pat> is an acceptable link target pattern. URLs not matching <pat> will not be stored in the anchors.gz file. If <pat> is conf_file, the link pattern will be sourced from the include_patterns in the data source configuration.

-F<file>

<file> defines an additional anchor text file.

-FN<file>

Like -F but source URLs should need not be looked up.

-RD<dir>

<dir> is a directory in which to look for redirects and duplicates files.

-igmaf

Ignore the main anchors file.

-mule<n>

Discard links to URL targets longer than <n> chars. Default is no limit.

-rmat

Record targets of failed anchor lookups via stdout.

-create_phrase_metadata_terms=<b>

Enables the creation of phrase terms like $++ foo bar $++ in the dictionary for metadata. These phrase terms can be used to speed up queries like a:$++ foo bar $++. Phrases will only be created if indexing of field boundaries is enabled (-ifb) , which it is by default. Disabling may reduce indexing time and index size.

Controlling which index files are generated

-nomdsf

Suppress generation of the .mdsf file.

-nolex

Suppress generation of the .lex file.

-noqicf

Suppress computation of QIC features and .qicf file.

-nohostf

Suppress computation of host features and .ghosts file.

-cleanup

Remove superfluous files from the index directory after index has completed.

Setting size limits

-GSB<n>

How many gscope bytes to allow for. Default: 8, Min: 2.

This setting is no longer required as gscope bits are now auto-sized.
-big<N>

Multiply word table sizes by 2N from base of 256K. Default table size is 8M (i.e. -big5).

-small

Divide word table sizes by 4 from base of 256K (i.e. use 64K).

-chamb<num>

Set decompression chamber size to <num> MB. Default 32

-RSDTF<num>

Set maximum characters in description and title fields in .results to <num>. Default 256.

-RSTAG<num>

Set number of bytes to reserve for tags in .results to <num>. Default 0.

-RSTXT<num>

Set maximum characters in summarizable text per doc in .results to <num>. Default 50000, maximum of 10000000.

-W<num>

Index-writing window will be <num> MB. Larger windows mean faster indexing at the expense of using more RAM. Default for a 64bit system is 2800

-MWIPD<num>

Maximum words indexed per document (excluding anchors). By default all words are indexed

-maxdocs<num>

Maximum number of documents to index. Others are ignored.

-mdsfml<n>

Set the number of bytes used for metadata summary field maximum lengths. Fields larger than this number will be truncated. Default is 2048.

-lock_string_mod_mode=<legacy|raw>

Sets how PADRE should modify the lock string before it is stored, legacy mode which removes some characters, replaces unquoted commas into new lines and removes consecutive new lines. raw mode stores the lock string as is up to the first null.

-99%

Limit on how full the word hash table can get.

Special indexing modes

-duplicate_urls=<flag|ignore>

(Default is flag.) Documents whose URL checksum is identical to that of another document are normally flagged and suppressed from results.

-urlchecksums=case_sensitive|case_insensitive

(Default is case_insensitive).

-paidads

If set, documents known to contain paid ads will be flagged specially (with the DOC_HAS_PAID_ADS flag).

-doc_feature_regex=<Regex>

Documents matching the supplied pattern will be flagged as DOC_MATCHES_REGEX The presence or absence of this feature can be used in the ranking function, controlled by the -cool.29 and -cool.30 query processor options.

-iolap

Overlap reading of bundles with processing them.

-utf8input

Assume all input files whose charset is not specified are UTF-8 encoded. (Default is WINDOWS-1252.)

-isoinput

Assume all input files whose charset is not specified are ISO_8859-1 encoded.

-force_iso

Forcibly assume all input files are ISO_8859-1 encoded.

-URLP<str>

When storing documents URLs, prepend <str>. (This is only used if the document does not indicate it’s own URL with a BASE HREF element, such as in local collections)

-lmd

HTTP LastModified date takes priority over metadata dates.

-lmd_never

Completely ignore HTTP LastModified dates.

-ignore_link_rel_canonical

Ignore canonical URL declarations in HTML link elements.

-ignore_noindex

Ignore robots noindex meta element.

-DT<str>

Interpret <str> as start of new doc within bundle. (Not a regular expression). Note that there is a separate mechanism for XML.

-annie[<exec>]

After normal indexing is complete, attempt to build an annotation index (annie) and spelling suggestion index. Default executables are annie-a and build-spelling-index from whence padre-iw was run.

-speller[<exec>]

Allows the explicit specification of a spelling_index builder to run after annie-a.

-spelleroff

turns of spelling-index building even if annie-a runs.

-spelling_threshold<i>

Annotations with fewer than <i> occurrences will not be considered as spelling suggestions. (dflt 1)

-bigweb

Space saving option for bundled large crawl indexes. Roughly equivalent to:

-nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2 -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet

Notes:

  • A shorter average word length is assumed.

  • You can add e.g. `-Axxx.com `to cut anchor processing time.

  • (Don’t forget to make dupredrex.txt in index directory.)

Miscellaneous options

-O<name>

<name> is the name of this organisation.

-T<path>

Specify a large temporary file space for use by the indexer.

-redis_host=<str>

Hostname/IP of a Redis server where progress status should be written

-redis_port=<i>

Port of the Redis server. Default is 6379

Security options.

-security_level=<i>

Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop. -security_mindocs=<i>:: Must be at least this number of docs with at least one lock.

See also: URL exclusion options in Controlling what is indexed above.

See also