Plugin: Clone documents

Purpose

Use this plugin when you need to return a HTML document multiple times in the search results.

The main use case for this plugin is to facilitate an events search where you have events that may span multiple dates. Cloning the item for each applicable date allows for an events search to be built with the event being returned for each matching date.

Usage

Enable the plugin

Select Plugins from the side navigation pane and click on the Clone documents tile.
From the Location section, decide if you wish to enable this plugin on a data source or a results page and select the corresponding radio button.
Select the data source or results page to which you would like to enable this plugin from the drop-down menu.

If enabled on a data source, the plugin will take effect as soon as the setup steps are completed, and an advanced > full update of the data source has completed. If enabled on a results page the plugin will take effect as soon as the setup steps are completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source or results page configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Choose how your document will be cloned

Configuration key

plugin.clone-documents.config.cloneMode

Data type

string

Default value

Repeated fields

Allowed values

Repeated fields,Repeated values in a field

Required

This setting is required

This option controls whether the document is cloned based on a repeated field within your document, or repeated values within a specified field.

Include selector

Configuration key

plugin.clone-documents.config.applyTo.selector

Data type

string

Required

This setting is required

Any document that contains elements that match this Jsoup selector will be cloned using the clone documents plugin, e.g. meta[content=events] selects meta tags with a content property of events

Multi-value element delimiter

Configuration key

plugin.clone-documents.config.elementDelimiter

Data type

string

Default value

|

Required

This setting is optional

Specifies a delimiter to split the selected element content on if the element contains multiple values.

Clone selector

Configuration key

plugin.clone-documents.config.cloneOn.selector

Data type

string

Required

This setting is required

The number of elements matching this Jsoup selector in the document determine the number of times that the document is cloned.

e.g. if you have a date metadata field repeated with 5 different date values, the document will be cloned 5 times.

Cloned URL suffix

Configuration key

plugin.clone-documents.config.clone.suffix

Data type

string

Required

This setting is optional

Specifies a suffix that will be attached to the end of the URL for the cloned records.

Add metadata field name

Configuration key

plugin.clone-documents.config.add.*.metatag_name

Data type

string

Required

This setting is optional

Specifies a metadata field to add to the cloned records. 'Parameter 1' specifies an ID that must match a corresponding 'Add metadata content' field.

Add metadata field content

Configuration key

plugin.clone-documents.config.add.*.metatag_content

Data type

string

Required

This setting is optional

Specifies a metadata value to insert into the cloned records. Parameter 1 must be a unique value within your data source configuration.

Remove selector

Configuration key

plugin.clone-documents.config.remove.*.selector

Data type

string

Required

This setting is optional

Elements that match this Jsoup selector will be removed from the cloned pages. 'Parameter 1' is a unique identifier used to enable multipl remove selector fields to be defined.

Additional configuration settings

The originalUrl metadata class must be added to the summary fields (-SF) option of the query processor options for the plugin to function correctly.

This is done by editing your results page configuration, and editing (or adding) the query_processor_options key. Edit (or add) the -SF value to include the originalUrl field.

Configuration key name Value

Configuration key name	Value
query_processor_options	`<OTHER QUERY PROCESSOR OPTIONS> -SF=[originalUrl,<OTHER METADATA CLASSES>]`

query_processor_options

<OTHER QUERY PROCESSOR OPTIONS> -SF=[originalUrl,<OTHER METADATA CLASSES>]

Tracking the original URL

The original URL of the page will be added as document metadata - adding two additional metadata fields: original-url and fb-original-url.

e.g.

<meta name="original-url" content="ORIGINAL-URL">
<meta name="fb-original-url" content="ORIGINAL-URL">

The meta tag fb-original-url is added to the metadata class originalUrl for use within results pages. If you need the original URL for any additional filters, the original-url metadata field should be used.

Canonical URLs

To prevent the default behaviour of handling the canonical link in Funnelback, all the canonical links in the cloned document will be removed during the index phase.

Facilitating grouping of split items

It will often be desirable to be able to search your index as if the items were not split, to avoid showing duplicates in your search result - the main use case for this is for an events search where you might have a view of the search where you show the results by date of event (meaning duplicate event items in the results makes sense) but also wish to just have a search that retrieves a matching event showing all the dates that the event might be occurring on.

Ensure result collapsing signatures are generated for the original Url.

To achieve this you should configure result collapsing using the originalUrl metadata class.

e.g. on the data source where you have configured the plugin add configuration key:

indexing.collapse_fields=[$],[originalUrl]

This will configure Funnelback to generate a result collapsing signature based on the originalUrl field value.

Applying the result collapsing

In order to apply the result grouping, you then enable result collapsing using the [originalUrl] collapsing signature on your results page. Add these two settings when you run a query that you wish to collapse, which will remove your duplicated (cloned) results.

e.g. as part of the configuration key query_processor_options

query_processor_options= -collapsing=on -collapsing_sig=[originalUrl]

or as URL query parameters

https://example-search.funnelback.squiz.cloud/s/search.html?collection=example&query=example&collapsing=on&collapsing_sig=[originalUrl]

See: Result collapsing

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.clonedocuments.CloneDocumentsStringFilter

Drag the com.funnelback.plugin.clonedocuments.CloneDocumentsStringFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

Clone event pages

In this example we have HTML event pages that contain metadata outlining the days that the event is running, with some events spanning multiple dates. We wish to create an events search, that shows events that run on specific dates.

Page with events in separate fields

Consider the following HTML page with the URL:

http://www.example.com/events/new-event

<html>
    <head>
        <title>Example Event Page</title>
        <meta name="page-type" content="events" >
        <meta name="event-dates" content="2023-01-01">
        <meta name="event-dates" content="2023-01-21">
        <meta name="internal-use" content="true">
        ...
    </head>
    ...
</html>

We wish to clone this event page for each occurrence of the event-dates metadata field, add the new metadata field call collapsing with the content recurring-event and remove metadata field internal-use. This can be achieved with the following configuration:

Configuration key name Parameter 1 Value

Configuration key name	Parameter 1	Value
Choose how your document will be cloned		Repeated fields
Include selector		`meta[content=events]`
Clone selector		`meta[name="event-dates"]`
Add metadata field name	`collapsing-metadata`	`collapsing`
Add metadata field content	`collapsing-metadata`	`recurring-event`
Remove selector	`extra-metadata`	`meta[name="internal-use"]`
Cloned URL suffix		`fb-recurring-event/`

Choose how your document will be cloned

Repeated fields

Include selector

meta[content=events]

Clone selector

meta[name="event-dates"]

Add metadata field name

collapsing-metadata

collapsing

Add metadata field content

collapsing-metadata

recurring-event

Remove selector

extra-metadata

meta[name="internal-use"]

Cloned URL suffix

fb-recurring-event/

Additional data source configuration

Setting the following in your results page configuration will enable you to apply result collapsing to collapse the cloned items into a single result if you need to change the result listing.

indexing.collapse_fields=[$],[originalUrl]

Ensure you also configure your metadata mappings. For an events search you will normally wish to map the field containing the event date to either the d (date) metadata class, or a numeric metadata class. If you are converting a legacy events search you will probably have the data mapped to the O metadata class.

Page with events in a single field

Consider the following HTML page with the URL, as is common for event pages that were created for Funnelback’s legacy events mode:

http://www.example.com/events/new-event

<html>
    <head>
        <title>Example Event Page</title>
        <meta name="page-type" content="events" >
        <meta name="event-dates" content="20230101 | 20230121">
        <meta name="internal-use" content="true">
        ...
    </head>
    ...
</html>

We wish to clone this event page for each occurrence of the event-dates metadata field. This can be achieved with the following configuration:

Configuration key name Parameter 1 Value

Configuration key name	Parameter 1	Value
Choose how your document will be cloned		Repeated values in a field
Include selector		`meta[content=events]`
Clone selector		`meta[name="event-dates"]`
Multi-value field delimiter		`\|`
Remove selector	`extra-metadata`	`meta[name="internal-use"]`
Cloned URL suffix		`fb-recurring-event/`

Choose how your document will be cloned

Repeated values in a field

Include selector

meta[content=events]

Clone selector

meta[name="event-dates"]

Multi-value field delimiter

|

Remove selector

extra-metadata

meta[name="internal-use"]

Cloned URL suffix

fb-recurring-event/

Cloned page output

The above configuration results in the following two HTML documents being included in the index.

http://www.example.com/events/new-event/fb-recurring-event/1

<html>
    <head>
        <title>Example event page</title>
        <meta name="page-type" content="events" >
        <meta name="event-dates" content="2023-01-01"> (1)
        <meta name="collapsing" content="recurring-event">
        <meta name="original-url" content="http://www.example.com/events/new-event">
        <meta name="fb-original-url" content="http://www.example.com/events/new-event">
        ...
    </head>
    ...
</html>

1	The value shown here is for the first configuration, the second configuration will set the event-dates to `20230101`

http://www.example.com/events/new-event/fb-recurring-event/2

<html>
    <head>
        <title>Example event page</title>
        <meta name="page-type" content="events" >
        <meta name="event-dates" content="2023-01-21"> (1)
        <meta name="collapsing" content="recurring-event">
        <meta name="original-url" content="http://www.example.com/events/new-event">
        <meta name="fb-original-url" content="http://www.example.com/events/new-event">
        ...
    </head>
    ...
</html>

1	The value shown here is for the first configuration, the second configuration will set the `event-dates` to `20230101`

When running a search the results page will return two records with a title of Example event page and a liveUrl and displayUrl set to http://www.example.com/events/new-event.

Configure your events listing

The events search will work similarly to any other search with the following differences:

Because you have cloned your events your results will look like there are duplicates. For events listing you will normally want to sort by the date metadata field (descending) and have some logic in your template that inserts a heading containing the event date in your results listing. In Freemarker this would look something like:

<#assign curdate="0">

<@s.Results>
  <#if s.result.class.simpleName != "TierBar">
    <#-- print a date heading if the date has changed-->
    <#if s.result.listMetadata["O"]?first != curdate> (1)
      <#assign curdate = s.result.listMetadata["O"]?first>
      <h3>Events occurring on ${curdate?date("yyyyMMdd")}</h3>
    </#if>

    <li data-fb-result="${s.result.indexUrl}" class="result<#if !s.result.documentVisibleToUser>-undisclosed</#if>">

1	In this example the event date is mapped to the `O` metadata class.

If you have a search of your events, or you are including your events in a general mixed search then you may wish to collapse your events so that there is only a single result returned for a specific event. This is accomplished by enabling results collapsing on that search. this can be done via a separate results page, or by adding additional request parameters to toggle on and apply the result collapsing signature.

If you are converting a legacy events search to use this plugin you will be able to match most of the functionality that you previously had except for complex queries where you returned events on a specific date, combined with date ranges where you were not specifying a >= type search (e.g. when you specified a search like music % O=20160415 O>20160415 O<20160630 O=20160505 O=20160510). However, mixed queries like this were very uncommon.

Change log

[1.2.2]

Changed

Upgraded Jsoup dependency from v1.17.1 to 1.19.1

[1.2.1]

Fixed

Fixed the cloned URL syntax when the original URL contains a query string.

[1.2.0]

Added

Add the Repeated values in a field clone mode, which allows for the cloning of documents based on repeated values in a field.

[1.1.0]

Changed

Updated to the latest version plugin framework (Funnelback shared v16.20) to enable integration with the new plugin management dashboard.

Help Center

Menu

Plugin: Clone documents

Purpose

Usage

Enable the plugin

Configuration settings

Choose how your document will be cloned

Include selector

Multi-value element delimiter

Clone selector

Cloned URL suffix

Add metadata field name

Add metadata field content

Remove selector

Additional configuration settings

Tracking the original URL

Canonical URLs

Facilitating grouping of split items

Ensure result collapsing signatures are generated for the original Url.

Applying the result collapsing

Filter chain configuration

Filter classes

Examples

Clone event pages

Page with events in separate fields

Additional data source configuration

Page with events in a single field

Cloned page output

Configure your events listing

Change log

[1.2.2]

Changed

[1.2.1]

Fixed

[1.2.0]

Added

[1.1.0]

Changed

See also