Plugin: Vector document storage

Purpose

Store documents in S3 bucket for vector chunking

Usage

Enable the plugin

  1. Select Plugins from the side navigation pane and click on the Vector document storage tile.

  2. From the Location section, select the data source to which you would like to enable this plugin from the Select a data source select list.

The plugin will take effect after setup steps and an advanced > full update of the data source has completed.

Configuration settings

The configuration settings section is where you do most of the configuration for your plugin. The settings enable you to control how the plugin behaves.

The configuration key names below are only used if you are configuring this plugin manually. The configuration keys are set in the data source configuration to configure the plugin. When setting the keys manually you need to type in (or copy and paste) the key name and value.

Document URL format

Configuration key

plugin.vector-document-storage.config.key-format-url

Data type

boolean

Default value

false

Required

This setting is optional

If true, use encoded document URL for S3 key object, in other way base 64 of document URL

Fail on error

Configuration key

plugin.vector-document-storage.config.fail-on-error

Data type

boolean

Default value

true

Required

This setting is optional

Defines if the update should fail with an error or just log a warning if document is not successfully sent to storage.

Possible values:

  • true: The update will fail with an error. (default)

  • false: a warning will be logged, but the update will continue.

Additional configuration settings

Required Global Configuration

This plugin requires two mandatory configuration keys to be set in the global configuration file /conf/collection.cfg to be inherited by all data sources:

S3 Bucket Configuration
plugin.vector-document-storage.config.bucket-name

The name of the S3 bucket where documents will be stored for vector chunking processing.

This is a required configuration that must be set globally in /conf/collection.cfg so data source can access the S3 bucket without individual configuration.

plugin.vector-document-storage.config.bucket-region

The AWS region where the S3 bucket is located (e.g., us-east-1, eu-west-1, ap-southeast-2).

This is a required configuration for the POC (Proof of Concept) implementation. The bucket region must be explicitly specified as the plugin needs to know the exact region to establish the S3 connection. This setting must be configured globally in /conf/collection.cfg.

Configuration Example

Add the following lines to your /conf/collection.cfg file:

# S3 Configuration for Vector Document Storage Plugin
plugin.vector-document-storage.config.bucket-name=your-s3-bucket-name
plugin.vector-document-storage.config.bucket-region=ap-southeast-2

Important Notes

  • Both configuration keys are mandatory and must be set in the global configuration file

  • These settings will be inherited by all data sources automatically

  • The bucket-region is required for the POC implementation to ensure proper S3 connectivity

Filter chain configuration

This plugin uses filters which are used to apply transformations to the gathered content.

The filters run in sequence and need be set in an order that makes sense. The plugin supplied filter(s) (as indicated in the listing) should be re-ordered to an appropriate point in the sequence.

Changes to the filter order affects the way the data source processes gathered documents. See: document filters documentation.

Filter classes

This plugin supplies a filter that runs in the main document filter chain: com.funnelback.plugin.vectordocumentstorage.VectorDocumentStorageStringFilter

Drag the com.funnelback.plugin.vectordocumentstorage.VectorDocumentStorageStringFilter plugin filter to where you wish it to run in the filter chain sequence.

Examples

S3 key format

Using Base64 encoding (default)

By default, the plugin uses Base64 encoding for S3 object keys. This is the recommended approach for most use cases:

Configuration key name Value

Document URL format

false

With this configuration, a document URL like https://example.com/page?param=value will be stored in S3 with a Base64-encoded key.

Using URL encoding

For better readability and debugging, you can use URL encoding for S3 object keys:

Configuration key name Value

Document URL format

true

With this configuration, a document URL like https://example.com/page?param=value will be stored in S3 with a URL-encoded key that is more human-readable.

Error handling

Fail on error (default)

By default, the plugin will fail the entire update process if any document fails to upload to S3:

Configuration key name Value

Fail on error

true

This is the recommended setting for production environments where data integrity is critical.

Continue on error

For development or when you want to ensure the update continues even if some documents fail to upload:

Configuration key name Value

Fail on error

false

With this configuration, if a document fails to upload to S3, a warning will be logged but the update process will continue with the remaining documents.

Change log

See also