Padre Indexer options
Indexer options are used to control how the Funnelback search index is built. These configuration options can be supplied as a space-separated list of options to the indexer via the
index_options data source configuration option. The options can be specified in any order.
For example, set the maximum number of documents to index to 1000, and increase the size of an indexed metadata field to 1500 characters:
indexer_options= -maxdocs1000 -mdsfml1500
|If you change or add any indexer options you need to rebuild your index for the changes to take effect. For a push data source this requires a vacuum to be executed via the push API. For other data source types you need to run an advanced update to rebuild the live index. (A normal update that runs will also apply any changes, but takes longer to run).|
Controlling what is indexed
Don’t index any metadata except
k(titles, dates and links).
Don’t concatenate strings in the mdsf file. Record first only. (Others are still indexed.)
Don’t index words in made-up URLs (those constructed from filepath).
Don’t index link anchor source as part of source documents (
Index all documents even if they appear to be binary.
Index words in HTML and XML comments.
Index every num1th file/bundle starting from num2th(from zero).
<interval>document within a bundle starting from
<offset>(which starts at zero). Only works with warc store.
Filenames in a
.tarfile being indexed must match regex. Default is to match everything.
URLs matching url_exclusion pattern will not be searchable. (Default:
exclusion pattern to use if URLs are vetted. (Default
exclusion pattern to use if files are to be excluded from indexing on the basis of the filepath. If applicable, this is more efficient than excluding by URL because the URL can’t be finally determined until the content has been scanned. (Default: not set)
.svndirectories created by the subversion version control system are not indexed. Override this default.
Controlling how things are indexed
Don’t conflate accents.
specify a Unicode mapping to be applied when indexing and when query processing. Supported values:
totraditional. (Chinese only.)
How much extra processing is done for umlaut and sz.
0: none. München is indexed as München and Munchen_
1: München is indexed as München, Muenchen and Munchen (Dflt)
2: As for 1 but also Muenchen is indexed as München, Muenchen and Munchen
NOTE: As a side-effect to allow for compounds, `SORT_SIGNIF` is increased to `40`
How much extra processing is done for Māori.
0: none. Māori and Mäori are indexed as Māori or Mäori respectively and Maori (Dflt)
1: Māori is indexed as Māori, Maaori and Maori; and Mäori is indexed as Mäori, Māori, Maaori and Maori
Suppress the indexing of bigrams/unigrams in CJKT text. It is assumed that said text has been pre-segmented into words, and that normal word-based indexing is needed.
Activate quick links on default pages of up to depth
<i>. Use internal quick links defaults. (Dflt 0 = Off)
Activate quicklinks. Read quicklinks configuration options from file
When trying to determine document type and charset, the indexer will look up to
<i>chars into the document. (Dflt 20480)
Use the XML parser on all documents.
Store case information in postings. Currently unsupported. Note that setting this reduces the approximate max number of unique terms from ~950M to ~240M.
How many UTF-8 characters in a word are significant. Default 30
Don’t index words or use words in summaries that are longer than what is set by
Controlling metadata indexing
<file>specifies a file defining XML indexing configurations in json format.
<file>specifies a file defining metadata mappings for both HTML and XML documents.
Index a special word
$++at the start and end of each metadata field (on by default).
Do not index a special word
$++at the start and end of each metadata field.
Which chars are used to separate metadata facet items. (Dflt '|')
Map anchor text in source file to metadata class
<f>is absent, outgoing anchor text is un-fielded content. (dflt
<file>is a file of external metadata, multiple files may be supplied by setting this multiple times.
Ignore explicitly specified internal metadata.
Index the name of the data source as metadata in each document and assign to field
Set the name of the data source being indexed.
Sets the maximum number of metadata names or XML paths PADRE will keep track of for counting the most frequent metadata or xpath that could be mapped.
Sets the number of the most frequent metadata names or XML paths PADRE should report on after indexing.
Controlling link and anchor text handling
Don’t extract, record or index anchor text.
.anchors.gzfile not processed. No link counts possible.
Extract and record but don’t index anchor text.
.anchors.gzfile can be post-processed by annie-a
Don’t canonicalize URLs when storing URLs or matching anchortext.
Controls whether PaDRE should canonicalise URLs with fragments (anchors)
on: PaDRE will drop any characters with the anchor symbol (#).
off: PaDRE will treat URLs with anchors as unique URLs.
Produce but don’t process the anchors distilled file.
Action to take for nepotistic links:
0: treat the same as other links.
1: ignore links of types greater than
2: limit the number of repetitions of links of types greater than
Ignore nepotistic links of types greater than the limit:
0: unaffiliated links from outside the target domain.
1: links from a different host.
2: links from the same or a closely affiliated host.
3: dynamically generated links from such a host.
Don’t let the low-value link cache grow above 2i
Don’t index image alt as anchor text when an image is an anchor.
Don’t index image src as anchor text when an image is an anchor.
<f>is a file of source URL patterns from which links should be ignored or treated with suspicion (blacklist).
<f>is a file of SECD (single entity controlled domain) affiliations. e.g.
griffith.edu.au --> gu.edu.au
Links to an affiliated SECD are classified as within-domain.
<f>is a file of CGI parameters which should be removed from source and target URLs. Padre generates a regular expression from the lines in
conf_filethe regex is taken from
crawler.remove_parametersin the data source configuration.
<pat>is an acceptable link target pattern. URLs not matching
<pat>will not be stored in the
conf_file, the link pattern will be sourced from the
include_patternsin the data source configuration.
<file>defines an additional anchor text file.
-Fbut source URLs should need not be looked up.
<dir>is a directory in which to look for redirects and duplicates files.
Ignore the main anchors file.
Discard links to URL targets longer than
<n>chars. Default is no limit.
Record targets of failed anchor lookups via stdout.
Enables the creation of phrase terms like
$++ foo bar $++in the dictionary for metadata. These phrase terms can be used to speed up queries like
a:$++ foo bar $++. Phrases will only be created if indexing of field boundaries is enabled (
-ifb) , which it is by default. Disabling may reduce indexing time and index size.
Controlling which index files are generated
Suppress generation of the
Suppress generation of the
Suppress computation of QIC features and
Suppress computation of host features and
Remove superfluous files from the index directory after index has completed.
Setting size limits
How many gscope bytes to allow for. Default: 8, Min: 2.
This setting is no longer required as gscope bits are now auto-sized.
Multiply word table sizes by 2N from base of 256K. Default table size is 8M (i.e.
Divide word table sizes by 4 from base of 256K (i.e. use 64K).
Set decompression chamber size to
<num>MB. Default 32
Set maximum characters in description and title fields in
<num>. Default 256.
Set number of bytes to reserve for tags in
<num>. Default 0.
Set maximum characters in summarizable text per doc in
<num>. Default 50000, maximum of 10000000.
Index-writing window will be
<num>MB. Larger windows mean faster indexing at the expense of using more RAM. Default for a 64bit system is 2800
Maximum words indexed per document (excluding anchors). By default all words are indexed
Maximum number of documents to index. Others are ignored.
Set the number of bytes used for metadata summary field maximum lengths. Fields larger than this number will be truncated. Default is 2048.
Sets how PADRE should modify the lock string before it is stored,
legacymode which removes some characters, replaces unquoted commas into new lines and removes consecutive new lines.
rawmode stores the lock string as is up to the first null.
Limit on how full the word hash table can get.
Special indexing modes
flag.) Documents whose URL checksum is identical to that of another document are normally flagged and suppressed from results.
If set, documents known to contain paid ads will be flagged specially (with the
Documents matching the supplied pattern will be flagged as
DOC_MATCHES_REGEXThe presence or absence of this feature can be used in the ranking function, controlled by the
-cool.30query processor options.
Overlap reading of bundles with processing them.
Assume all input files whose charset is not specified are UTF-8 encoded. (Default is
Assume all input files whose charset is not specified are
Forcibly assume all input files are
When storing documents URLs, prepend
<str>. (This is only used if the document does not indicate it’s own URL with a BASE HREF element, such as in local collections)
HTTP LastModified date takes priority over metadata dates.
Completely ignore HTTP
Ignore canonical URL declarations in HTML link elements.
Ignore robots noindex meta element.
<str>as start of new doc within bundle. (Not a regular expression). Note that there is a separate mechanism for XML.
After normal indexing is complete, attempt to build an annotation index (annie) and spelling suggestion index. Default executables are
Allows the explicit specification of a
spelling_indexbuilder to run after
turns of spelling-index building even if
Annotations with fewer than
<i>occurrences will not be considered as spelling suggestions. (dflt 1)
Space saving option for bundled large crawl indexes. Roughly equivalent to:
-nomdsf -big8 -MWIPD2000 -W6000 -SORTSIG16 -nep_action=2 -nep_limit=2 -nep_cachebits=20 -chamb64 -RSTXT2000 -mule128 -noaltanx -nosrcanx -nometa -quiet
A shorter average word length is assumed.
You can add e.g. `-Axxx.com `to cut anchor processing time.
(Don’t forget to make
dupredrex.txtin index directory.)
<name>is the name of this organisation.
Specify a large temporary file space for use by the indexer.
Hostname/IP of a Redis server where progress status should be written
Port of the Redis server. Default is 6379
Any non-zero value requires every document to have at least one lock. If set to 1 documents without locks will be excluded, if set to greater than 1 indexing will stop.
-security_mindocs=<i>:: Must be at least this number of docs with at least one lock.
|See also: URL exclusion options in Controlling what is indexed above.|