Plugin: Exclude by content

Purpose

Use this plugin to exclude a document based on the content within the document. Datasource update expects at least one document to be found, therefore if all documents are filtered out the update will fail.

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Exclude by content tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Exclude selector

Configuration key

plugin.exclude-by-content.config.exclude.element_selector

Data type

string

Required

This setting is required

Defines the element whose content will be used for the basis of the exclude pattern match. The selector must be a valid Jsoup selector (e.g. 'book > title') or XPath e.g. ('/book/title' or '//title'). Note that Jsoup supports two syntaxes for XML documents and 'book > title' and '//book/title' are equivalent.

If more than one element is returned by the selector, the document will be filtered if any element matches the requirement

Exclude pattern

Configuration key

plugin.exclude-by-content.config.exclude.pattern

Data type

string

Required

This setting is required

This key defines pattern that is compared to the exclude selector content. If the pattern matches then the document will be excluded from the index. The exclude match type setting determines how this value is matched against the source data.

Exclude match type

Configuration key

plugin.exclude-by-content.config.exclude.match_type

Data type

string

Default value

EXACT_MATCH

Allowed values

EXACT_MATCH,CONTAINS,REGEX

Required

This setting is required

This controls how the exclude pattern is matched against the source data. Non-regex match types are case-insensitive.

Available choices are:

  • EXACT_MATCH: The exclude pattern must match all the content of the field value. This match is case-insensitive.

  • CONTAINS: The exclude pattern must be present in the content of the field value. This match is case-insensitive.

  • REGEX_MATCH: The exclude pattern is a Java regular expression applied to the content of the field value. This match is case-sensitive unless you include the case-insensitive flag (?i) in your regular expression.

Exclude match source

Configuration key

plugin.exclude-by-content.config.exclude.match_source

Data type

string

Default value

CONTENT

Allowed values

CONTENT,ATTRIBUTE

Required

This setting is required

This controls if the pattern is matched against field content or attribute content.

Available choices are:

  • CONTENT: The match will be compared to the element content.

  • ATTRIBUTE: The match will be compared to the element attribute value defined in the 'Exclude match source attribute' key.

Exclude match source attribute

Configuration key

plugin.exclude-by-content.config.exclude.match_source_attribute

Data type

string

Required

This setting is optional

This specifies the attribute name that should be used for comparison when the 'Exclude match source' is set to 'Attribute'.

Negate the match?

Configuration key

plugin.exclude-by-content.config.exclude.negate_match

Data type

boolean

Default value

false

Required

This setting is optional

If set to true the match result will get inverted. e.g. if you have chosen an exact match type, then the exclude pattern will apply if the rule does not exactly match.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.excludebycontent.ExcludeByContentStringFilter

Drag the com.funnelback.plugin.excludebycontent.ExcludeByContentStringFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

XML examples

Given an XML document as following:

<?xml version="1.0" encoding="UTF-8" ?>
<bookstore>
    <book>
        <title>Everyday English</title>
    </book>
    <book>
        <title>Funnelback Introduction</title>
    </book>
</bookstore>

The following configuration examples will delete the example XML record.

Example: Exact match

Configuration key name Value

Exclude selector

//book/title

Exclude pattern

Everyday English

Exclude match source

CONTENT

Exclude match type

EXACT_MATCH

Negate the match?

false

Example: Contains

Configuration key name Value

Exclude selector

//book/title

Exclude pattern

back

Exclude match source

CONTENT

Exclude match type

CONTAINS

Negate the match?

false

Example: regular expression

Configuration key name Value

Exclude selector

//book/title

Exclude pattern

(?i)^every(\s*)day(\s*)english$

Exclude match source

CONTENT

Exclude match type

REGEX

Negate the match?

false

The filter will be applied to your XML document as long as the rule matches any part of your record.

e.g. The following configuration will remove the result.

Configuration key name Value

Exclude selector

//book/title

Exclude pattern

Introduction

Exclude match source

CONTENT

Exclude match type

CONTAINS

Negate the match?

true

As the first record - Everyday English does not contain Introduction, so the result will be filtered, even though the second record title contains Introduction.

If you wish for each <book> to be separately processed then you should split your XML document using the split XML or HTML plugin before applying this plugin.

Example: Multi-record XML document with attributes

Consider the following XML document.

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in
      detail, with attention to XML DOM interfaces, XSLT processing,
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are
      integrated into a comprehensive development
      environment.</description>
   </book>
</catalog>

We wish to index the book records of this XML as separate documents, but exclude the bk103 and bk109 records.

To achieve this you first need to enable and configure the Split XML or HTML plugin to split your XML:

Configuration key name Value

Default XPath for splitting XML documents

/catalog/book

You can then enable and configure the exclude by content plugin (this plugin) with the following:

Configuration key name Value

Exclude selector

/book

Exclude pattern

bk103|bk109

Exclude match source

ATTRIBUTE

Exclude match source attribute

id

Exclude match type

REGEX

HTML examples

Given the following example HTML document:

<!DOCTYPE html>
<html>
    <head>
        <title>Example html document</title>
        <meta name="status" content="expired" />
    </head>
    <body>
        <h1>Example doc</h1>
        <p>Doc content</p>
    </body>
</html>

The following examples will delete the HTML record:

Example: Tag contains content

Exclude pages containing the word example in a <h1> element.

Configuration key name Value

Exclude selector

h1

Exclude pattern

Example

Exclude match source

CONTENT

Exclude match type

CONTAINS

Negate the match?

false

Example: Metadata field value is not equal to

Exclude items where the metadata status is not set to active.

i.e. documents containing a <meta name="status"> field where the content attribute is not set to active.

Configuration key name Value

Exclude selector

meta[name=status]

Exclude pattern

active

Exclude match source

ATTRIBUTE

Exclude match source attribute

content

Negate the match?

true

Change log