Plugin: Exclude by content
Purpose
Use this plugin to exclude a document based on the content within the document. Datasource update expects at least one document to be found, therefore if all documents are filtered out the update will fail.
Usage
Enable the plugin
-
Select Plugins from the side navigation pane and click on the Exclude by content tile.
-
From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.
The plugin will take effect after setup steps and an advanced > full update of the data source has completed. |
Configuration settings
The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.
The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value. |
Exclude selector
Configuration key |
|
Data type |
string |
Required |
This setting is required |
Defines the element whose content will be used for the basis of the exclude pattern match. The selector must be a valid Jsoup selector (e.g. 'book > title') or XPath e.g. ('/book/title' or '//title'). Note that Jsoup supports two syntaxes for XML documents and 'book > title' and '//book/title' are equivalent.
If more than one element is returned by the selector, the document will be filtered if any element matches the requirement
Exclude pattern
Configuration key |
|
Data type |
string |
Required |
This setting is required |
This key defines pattern that is compared to the exclude selector content. If the pattern matches then the document will be excluded from the index. The exclude match type setting determines how this value is matched against the source data.
Exclude match type
Configuration key |
|
Data type |
string |
Default value |
|
Allowed values |
EXACT_MATCH,CONTAINS,REGEX |
Required |
This setting is required |
This controls how the exclude pattern is matched against the source data. Non-regex match types are case-insensitive.
Available choices are:
-
EXACT_MATCH: The exclude pattern must match all the content of the field value. This match is case-insensitive.
-
CONTAINS: The exclude pattern must be present in the content of the field value. This match is case-insensitive.
-
REGEX_MATCH: The exclude pattern is a Java regular expression applied to the content of the field value. This match is case-sensitive unless you include the case-insensitive flag
(?i)
in your regular expression.
Exclude match source
Configuration key |
|
Data type |
string |
Default value |
|
Allowed values |
CONTENT,ATTRIBUTE |
Required |
This setting is required |
This controls if the pattern is matched against field content or attribute content.
Available choices are:
-
CONTENT: The match will be compared to the element content.
-
ATTRIBUTE: The match will be compared to the element attribute value defined in the 'Exclude match source attribute' key.
Exclude match source attribute
Configuration key |
|
Data type |
string |
Required |
This setting is optional |
This specifies the attribute name that should be used for comparison when the 'Exclude match source' is set to 'Attribute'.
Negate the match?
Configuration key |
|
Data type |
boolean |
Default value |
|
Required |
This setting is optional |
If set to true the match result will get inverted. e.g. if you have chosen an exact match type, then the exclude pattern will apply if the rule does not exactly match.
Filter chain configuration
This plugin uses filters which are used to apply transformations to the gathered content.
The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.
Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation. |
Filter classes
This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.excludebycontent.ExcludeByContentStringFilter
Drag the com.funnelback.plugin.excludebycontent.ExcludeByContentStringFilter plugin filter to where you wish it to run in the filter chain sequence.
Examples
XML examples
Given an XML document as following:
<?xml version="1.0" encoding="UTF-8" ?>
<bookstore>
<book>
<title>Everyday English</title>
</book>
<book>
<title>Funnelback Introduction</title>
</book>
</bookstore>
The following configuration examples will delete the example XML record.
Example: Exact match
Configuration key name | Value |
---|---|
Exclude selector |
|
Exclude pattern |
|
Exclude match source |
|
Exclude match type |
|
Negate the match? |
|
Example: Contains
Configuration key name | Value |
---|---|
Exclude selector |
|
Exclude pattern |
|
Exclude match source |
|
Exclude match type |
|
Negate the match? |
|
Example: regular expression
Configuration key name | Value |
---|---|
Exclude selector |
|
Exclude pattern |
|
Exclude match source |
|
Exclude match type |
|
Negate the match? |
|
The filter will be applied to your XML document as long as the rule matches any part of your record. e.g. The following configuration will remove the result.
As the first record - If you wish for each |
Example: Multi-record XML document with attributes
Consider the following XML document.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
We wish to index the book records of this XML as separate documents, but exclude the bk103 and bk109 records.
To achieve this you first need to enable and configure the Split XML or HTML plugin to split your XML:
Configuration key name | Value |
---|---|
Default XPath for splitting XML documents |
|
You can then enable and configure the exclude by content plugin (this plugin) with the following:
Configuration key name | Value |
---|---|
Exclude selector |
|
Exclude pattern |
|
Exclude match source |
|
Exclude match source attribute |
|
Exclude match type |
|
HTML examples
Given the following example HTML document:
<!DOCTYPE html>
<html>
<head>
<title>Example html document</title>
<meta name="status" content="expired" />
</head>
<body>
<h1>Example doc</h1>
<p>Doc content</p>
</body>
</html>
The following examples will delete the HTML record:
Example: Tag contains content
Exclude pages containing the word example in a <h1>
element.
Configuration key name | Value |
---|---|
Exclude selector |
|
Exclude pattern |
|
Exclude match source |
|
Exclude match type |
|
Negate the match? |
|
Example: Metadata field value is not equal to
Exclude items where the metadata status is not set to active.
i.e. documents containing a <meta name="status">
field where the content attribute is not set to active.
Configuration key name | Value |
---|---|
Exclude selector |
|
Exclude pattern |
|
Exclude match source |
|
Exclude match source attribute |
|
Negate the match? |
|