Built-in filters - Inject No Index tags (InjectNoIndexFilterProvider)

The inject no index tags filter is used to hide content from the search indexer by wrapping specific HTML elements in special noindex comments. The documents to process are based on their URL, and the elements to wrap are designated using Jsoup CSS-like selectors.

To configure this filter you need to define new settings in your data source configuration, using the prefix filter.noindex.[N], where [N] is a unique integer used to identify the rule. Each entry will contain one URL pattern per line (defined using standard Java-type regular expressions), with the corresponding CSS selectors. URLs and selectors are separated by a space.

  • Ensure that the selected regions don’t result in any nested areas as the hidden regions will not be skipped as expected. See the final example below for further explanation.

  • URL regular expressions containing spaces must have any spaces URL encoded (%20).

Enabling

To enable the filter add InjectNoIndexFilterProvider to the filter chain.

A full update of the data source is required for this filter to take effect.

Example

The following lines have been added to the data source configuration:

filter.classes=TikaFilterProvider,ExternalFilterProvider:InjectNoIndexFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
filter.noindex.1=.* header,footer
filter.noindex.2=example\.com div.navigation,#footer
filter.noindex.3=page\?type=resource div.hidden
filter.noindex.4=https://example\.com/.*/folder%20with%20spaces/.* input[type=text]

This configuration will:

  • cause all documents' <header> and <footer> tags to be wrapped in noindex expressions.

  • cause all the documents with server.com in their URL to have all the <div class="navigation"> and the element with id="footer" to be wrapped in noindex expressions.

  • cause all documents named page with a URL containing the value resource for the type parameter to have their <div class="hidden"> wrapped in noindex expressions.

  • cause all documents on https://example.com/ that have a folder with spaces in their URL to have their inputs of type text wrapped in noindex expressions.

Example input

http://example.com/home

...
<div class="navigation">
  <p>Navigation lives here.</p>
</div>
...
<footer id="footer">
  <p>Footer lives here.</p>
</footer>
...

http://example.com/path/to/page?type=resource

...
<div class="hidden">
  <p>Secret hidden text lives here.</p>
</div>
...
<span class="hidden special">
  <p>Secret special hidden text lives here too.</p>
</span>
...

https://example.com/path/to/folder/long%20name/page

...
<input type="text" name="example" id="example" />
...

http://example.com/navexample

...
<footer id="footer">
  <div class="navigation">
    <p>Navigation lives here.</p>
  </div>

  <p>Footer lives here.</p>
</footer>
...

Example output

http://example.com/home

...
<!--noindex-->
<div class="navigation">
  <p>Navigation lives here.</p>
</div>
<!--endnoindex-->
...
<!--noindex-->
<footer id="footer">
  <p>Footer lives here.</p>
</footer>
<!--endnoindex-->
...

http://example.com/path/to/page?type=resource

...
<!--noindex-->
<div class="hidden">
  <p>Secret hidden text lives here.</p>
</div>
<!--endnoindex-->
...
<!--noindex-->
<span class="hidden special">
  <p>Secret special hidden text lives here too.</p>
</span>
<!--endnoindex-->
...

https://example.com/path/to/folder/long%20name/page

...
<!--noindex-->
<input type="text" name="example" id="example" />
<!--endnoindex-->
...

http://example.com/navexample

this example illustrates why nested filter.noindex rules do not work correctly. In this example everything after the first <!--endnoindex--> will be indexed. Recall that noindex and endnoindex rules operate like switches. When a <!--noindex--> is encountered indexing ceases, and when a <!--endnoindex--> is encountered indexing recommences.
...
<!--noindex-->
<footer id="footer">
<!--noindex-->
  <div class="navigation">
    <p>Navigation lives here.</p>
  </div>
<!--endnoindex-->

  <p>Footer lives here.</p>
</footer>
<!--endnoindex-->
...