Plugin: Clone documents

Purpose

Use this plugin when you need to return a HTML document multiple times in the search results.

The main use case for this plugin is to facilitate an events search where you have events that may span multiple dates. Cloning the item for each applicable date allows for an events search to be built with the event being returned for each matching date.

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Clone documents tile.

  2. From the Location section, decide if you wish to enable this plugin on a data source or a results page and select the corresponding radio button.

  3. Select the data source or results page to which you would like to enable this plugin from the drop-down menu.

If enabled on a data source, the plugin will take effect as soon as the setup steps are completed, and an advanced > full update of the data source has completed. If enabled on a results page the plugin will take effect as soon as the setup steps are completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source or results page configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Include selector

Configuration key

plugin.clone-documents.config.applyTo.selector

Data type

string

Required

This setting is required

Any document that contains elements that match this Jsoup selector will be cloned using the clone documents plugin, e.g. meta[content=events] selects meta tags with a content property of events

Clone selector

Configuration key

plugin.clone-documents.config.cloneOn.selector

Data type

string

Required

This setting is required

The number of elements matching this Jsoup selector in the document determine the number of times that the document is cloned.

e.g. if you have a date metadata field repeated with 5 different date values, the document will be cloned 5 times.

Cloned URL suffix

Configuration key

plugin.clone-documents.config.clone.suffix

Data type

string

Required

This setting is optional

Specifies a suffix that will be attached to the end of the URL for the cloned records.

Add metadata field name

Configuration key

plugin.clone-documents.config.add.*.metatag_name

Data type

string

Required

This setting is optional

Specifies a metadata field to add to the cloned records. 'Parameter 1' specifies an ID that must match a corresponding 'Add metadata content' field.

Add metadata field content

Configuration key

plugin.clone-documents.config.add.*.metatag_content

Data type

string

Required

This setting is optional

Specifies a metadata value to insert into the cloned records. Parameter 1 must be a unique value within your data source configuration.

Remove selector

Configuration key

plugin.clone-documents.config.remove.*.selector

Data type

string

Required

This setting is optional

Elements that match this Jsoup selector will be removed from the cloned pages. 'Parameter 1' is a unique identifier used to enable multipl remove selector fields to be defined.

Additional configuration settings

The originalUrl metadata class must be added to the summary fields (-SF) option of the query processor options for the plugin to function correctly.

THis is done by editing your results page configuration, and editing (or adding) the *query_processor_options` key. Edit (or add) the -SF value to include the originalUrl field.

Configuration key name Value

query_processor_options

<OTHER QUERY PROCESSOR OPTIONS> -SF=[originalUrl,<OTHER METADATA CLASSES>]

Tracking the original URL

The original URL of the page will be added as document metadata - adding two additional metadata fields: original-url and fb-original-url.

e.g.

<meta name="original-url" content="ORIGINAL-URL">
<meta name="fb-original-url" content="ORIGINAL-URL">

The meta tag fb-original-url is added to the metadata class originalUrl for use within results pages. If you need the original URL for any additional filters, the original-url metadata field should be used.

Canonical URLs

To prevent the default behaviour of handling the canonical link in Funnelback, all the canonical links in the cloned document will be removed during the index phase.

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.clonedocuments.CloneDocumentsStringFilter

Drag the com.funnelback.plugin.clonedocuments.CloneDocumentsStringFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

Clone event pages

In this example we have HTML event pages that contain metadata outlining the days that the event is running, with some events spanning multiple dates. We wish to create an events search, that shows events that run on specific dates.

Consider the following HTML page with the URL:

http://www.example.com/events/new-event
<html>
    <head>
        <title>Example Event Page</title>
        <meta name="page-type" content="events" >
        <meta name="event-date" content="2023-01-01">
        <meta name="event-date" content="2023-01-21">
        <meta name="internal-use" content="true">
        ...
    </head>
    ...
</html>

We wish to clone this event page for each occurrence of the event-date metadata field. This can be achieved with the following configuration:

Configuration key name Parameter 1 Value

Include selector

meta[content=events]

Clone selector

meta[name="event-date"]

Add metadata field name

collapsing-metadata

collapsing

Add metadata field content

collapsing-metadata

recurring-event

Remove selector

extra-metadata

meta[name="internal-use"]

Cloned URL suffix

fb-recurring-event/

And originalUrl added to the query processor options SF value in the results page configuration.

This results in the following two HTML documents being included in the index.

http://www.example.com/events/new-event/fb-recurring-event/1
<html>
    <head>
        <title>Example event page</title>
        <meta name="page-type" content="events" >
        <meta name="event-date" content="2023-01-01">
        <meta name="collapsing" content="recurring-event">
        <meta name="original-url" content="http://www.example.com/events/new-event">
        <meta name="fb-original-url" content="http://www.example.com/events/new-event">
        ...
    </head>
    ...
</html>
http://www.example.com/events/new-event/fb-recurring-event/2
<html>
    <head>
        <title>Example event page</title>
        <meta name="page-type" content="events" >
        <meta name="event-date" content="2023-01-21">
        <meta name="collapsing" content="recurring-event">
        <meta name="original-url" content="http://www.example.com/events/new-event">
        <meta name="fb-original-url" content="http://www.example.com/events/new-event">
        ...
    </head>
    ...
</html>

When running a search the results page will return two records with a title of Example event page and a liveUrl and displayUrl set to http://www.example.com/events/new-event.

Change log

[1.1.0]

Changed

  • Updated to the latest version plugin framework (Funnelback shared v16.20) to enable integration with the new plugin management dashboard.