Optional regular expression to specify content that should not be indexed.
Can be set in: collection.cfg
This parameter defines an optional regular expression that will be used to specify content that should not be indexed. If defined and non-empty the given expression will be used by the webcrawler to insert no index tags around matching content in the copy of the document source that the crawler stores. This will have the following effects:
- The content that matches the expression will be ignored when deciding if two files are duplicates based on their extracted text during a web crawl. This can be used to exclude dynamic content on a page which may hinder duplicate detection.
- The PADRE indexer will take note of the noindex directives. (See controlling indexable content in PADRE for details)
Any links inside the matching content will still be extracted and followed during the crawl (assuming they pass the include/exclude rules).
Ignore some "breadcrumb" navigation elements in a page:
Ignore a HTML footer:
Ignore a set of different
<div> elements using a single regular expression:
noindex_expression=(<div id=\"nav.*?<\/div>|<div id=\"skip.*?<\/div>|<div id=\"header.*?<\/div>)
The expression above uses the | character as an OR (alternation) operator e.g. match (pattern 1 | pattern 2 | pattern 3).
You will probably want to use a non-greedy match (see ? in the pattern above), to ensure that the regular expression doesn't match (and ignore) more than you need.
Note:: You should test that the expression you have written does not have a performance impact on the crawl. For example, if some of your content has badly formed HTML then the regular expression may match more than you need, or potentially result in the parser timing out and the page not being downloaded at all. If the latter occurs you may see "Parser timed out" messages in the url_errors.log file in the collection's log directory.