Plugin: Clone documents

Other versions of this plugin may exist. Please ensure you are viewing the documentation for the version that you are currently using. If you are not running the latest version of your plugin we recommend upgrading. See: list of all available versions of this plugin.

Purpose

This plugin clones a specified HTML document into multiple documents within the Funnelback index.

One use case for this plugin is to facilitate an events search where you have events that may span multiple dates. Cloning the item for each applicable date allows for an events search to be built with the event being returned for each matching date.

Usage

Enable the plugin

Enable the clone-documents plugin on your both results page and data source from the extensions screen in the administration dashboard or add the following results page and data source configuration to enable the plugin.

The plugin should be enabled on the data source where your document is gathered, and on the results page(s) where you are searching for the cloned documents.

plugin.clone-documents.enabled=true
plugin.clone-documents.version=1.0.0

Data source

Add com.funnelback.plugin.clonedocuments.CloneDocumentsStringFilter to the filter chain:

filter.classes=<OTHER-FILTERS>:com.funnelback.plugin.clonedocuments.CloneDocumentsStringFilter:<OTHER-FILTERS>
The clone documents filter should be inserted at an appropriate point in the filter chain.

Results page

The following option is required in the results page configuration to configure the plugin:

query_processor_options=<OTHER PARAMETERS> -SF=[originalUrl,<OTHER PARAMETERS>]

Plugin configuration settings

The following options can be set in the data source configuration to configure the plugin:

  • plugin.clone-documents.config.applyTo.selector: (Required) Documents that contain this jsoup selector will be cloned using the clone documents plugin, e.g. meta[content=events]

  • plugin.clone-documents.config.cloneOn.selector: (Required) This jsoup selector determines the number of times that the document is cloned. A clone of the document will be created for each occurrence of a matching selector. e.g. if you have a data metadata field repeated with 5 different date values, you will end up with 5 copies of the document in the Funnelback index.

  • plugin.clone-documents.config.clone.suffix : (Optional) Specifies a suffix that will be attached to the end of the URL for the cloned records.

  • plugin.clone-documents.config.add.<id>.metatag_name and plugin.clone-documents.config.add.<id>.metatag_content: (Optional) Specifies a metadata field and value to insert into the cloned records. <id> must be a unique value within your data source configuration.

  • plugin.clone-documents.config.remove.<id>.selector: (Optional) This will remove the specified selector from the cloned page. <id> must be a unique value within your data source configuration.

The plugin will take effect after a full update of the data source.
  • To prevent the default behaviour of handling the canonical link in Funnelback, all the canonical links in the cloned document will be removed during the index phase.

Tracking the original URL

The original URL of the page will be added as document metadata - adding two additional metadata fields: original-url and fb-original-url.

e.g.

<meta name="original-url" content="<ORIGINAL_URL>">
<meta name="fb-original-url" content="<ORIGINAL_URL>">

The meta tag fb-original-url is added to the metadata class originalUrl for use within results pages. If you need the original URL for any additional filters, the original-url metadata field should be used.

Example

Consider the following HTML page with the URL: http://www.example.com/events/new-event

<html>
    <head>
        <title>Example Event Page</title>
        <meta name="page-type" content="events" >
        <meta name="event-date" content="2023-01-01">
        <meta name="event-date" content="2023-01-21">
        <meta name="internal-use" content="true">
        ....
    </head>
    .....
</html>

We wish to clone this event page for each occurrence of the event-date metadata field. This can be achieved with the following configuration:

plugin.clone-documents.config.applyTo.selector=meta[content=events]
plugin.clone-documents.config.cloneOn.selector=meta[name="event-date"]
plugin.clone-documents.config.add.collapsing-metadata.metatag_name=collapsing
plugin.clone-documents.config.add.collapsing-metadata.metatag_content=recurring-event
plugin.clone-documents.config.remove.extra-metadata.selector=meta[name="internal-use"]
plugin.clone-documents.config.clone.suffix=fb-recurring-event/

This results in the following two HTML documents being included in the index.

<html>
    <head>
        <title>Example event page</title>
        <meta name="page-type" content="events" >
        <meta name="event-date" content="2023-01-01">
        <meta name="collapsing" content="recurring-event">
        <meta name="original-url" content="http://www.example.com/events/new-event">
        <meta name="fb-original-url" content="http://www.example.com/events/new-event">
        ...
    </head>
    ...
</html>

with URL: http://www.example.com/events/new-event/fb-recurring-event/1, and

<html>
    <head>
        <title>Example event page</title>
        <meta name="page-type" content="events" >
        <meta name="event-date" content="2023-01-21">
        <meta name="collapsing" content="recurring-event">
        <meta name="original-url" content="http://www.example.com/events/new-event">
        <meta name="fb-original-url" content="http://www.example.com/events/new-event">
        ...
    </head>
    ...
</html>

with URL: http://www.example.com/events/new-event/fb-recurring-event/2.

When running a search the results page will return two records with a title of Example event page and a liveUrl and displayUrl set to http://www.example.com/events/new-event.

All versions of clone-documents