Direct integration with a push index

Push data sources provide an API that can be used to control the push data source.

This includes programmatically maintaining the content within the push index.

Content can be added, retrieved, updated and removed via the API. Additional metadata can be associated with content if necessary (via HTTP headers, or multi-part HTTP requests). Each content has a unique identifier, a Key, which needs to be a valid URL. This key is used to retrieve, update or delete the content, and is used as the content URL when displaying search results.

Content is made available for search after the pending changes on a data source are committed. Commit can be done manually via the API, and will also happen automatically after a configured timeout or after a number of changes have been made.

Filters can also be run when submitting content via the API.

Update process model

Push indexes use a transactional update model that is serviced using the push API.

Content is added, updated and removed using the corresponding API calls and then committed to the index.

Push API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on push-api.

Or you can go directly to:

https://FUNNELBACK-SERVER:ADMIN-PORT/search/admin/api-ui/#!/push-api

Getting started

You will first need to create a push index data source from the admin home page. Once the data source is created you will not need to start the data source as it will be automatically started on the next API request.

Lets add a document with URL: http://myfirstdocument/ to your Push data source. You will need to perform a PUT request to add the document into your push data source, using the following URL and supplying the document in the body of the request.

PUT https://host_name:admin_port/push-api/v1/collections/collection_name/documents?key=http%3A%2F%2Fmyfirstdocument%2F

The easiest way to try this out is to use the Swagger UI, under the section:

PUT /push-api/v2/collections/{collection}/documents

Set the data source to the name of the data source you created, then set the key to:

http://myfirstdocument/

and set the content to:

The quick brown fox jumps over the lazy dog

don’t change the values of any other fields and then click try it out!

By default Push will make documents searchable as soon as possible, perform a search for fox over your data source and you should see a single result with URL http://myfirstdocument/.

If you wish to update the document, you can PUT the document in with the same key and different content. PUT requests will always replace exiting content at the given key. Refer to the Swagger UI documentation for more information.

Push also supports deleting documents. To delete the previously added document we would use the following DELETE call:

DELETE https://host_name:admin_port/push-api/v1/collections/examplecoll/documents?key=http%3A%2F%2Fmyfirstdocument%2F

You can try this from the Swagger UI under the section

DELETE /push-api/v2/collections/{collection}/documents

Set the data source to the name of the data source you created, then set the key to:

http://myfirstdocument/

then click try it out!

If you perform a search for fox over your data source the document will no longer be returned.

Authentication

The Push API accepts both basic authentication as well as token based authentication. Although basic authentication is simpler to use it is not efficient as the authentication will be re-performed server side for every request. It is preferable to use token based authentication if possible. See API token authentication for information about getting and supplying tokens.

Writing for a push index

When you write code to interact with a push index you must build error handling into your code.

This includes ensuring that any errors returned by the push API are appropriately handled by your code. For example, ensuring that all add, remove and update operations are queued and retried if there is an error, and if subsequent errors occur then exiting with an appropriate error which is fed back into the system that launched the request. If you fail to handle push API errors the push index can get out of sync with your content - you may find that content is missing from the index, or that something that was deleted remains in the index.

Workflow

When changes to a push data source are made such as PUT-ing or DELETE-ing a document or PUT-ing a redirect, they are first stored in a staging area. These changes are not made live until they are committed.

Commits can be triggered through the API or automatically from a timeout or after a number of changes have been made to the push data source.

Internally push data sources will need to merge generations created from commits. This is required to stop performance from degrading. Push will automatically trigger merges as required. Merging on large data sources can be costly and may impact query performance if run on a machine with insufficient CPUs or memory.

Index configuration changes require a Vacuum

Re-indexing a push data source is required when changing index configuration settings, such as indexer_options, GScopes, metadata mappings, etc.

To re-index a push data source, use the vacuum API call and set vacuum-type to RE_INDEX. This can be a long operation on large data sources as the complete content will be re-indexed.

Keys

Push expects all keys to be valid URIs, and expects that all keys have a scheme name such as http, ftp, local, etc. Push will also canonicalise keys - you can check the returned JSON for the key used by Push. In general, Push will:

Metadata

Extra metadata may be added to a document when it is PUT into Push. For the end-point shown above, Push uses HTTP headers that start with X-Funnelback-Push-Metadata-, the rest of the header name is used as the name as the metadata and the value is what is added to the document’s metadata for example setting the following HTTP header:

X-Funnelback-Push-Metadata-author: William Shakespeare

would set the 'author' metadata to 'William Shakespeare'. You will need to define a metadata mapping to access the metadata from the indexer and query processor. As HTTP headers are case insensitive in the HTTP specification but not the Java Servlet specification, metadata keys may be converted to lower-case in some environments. If case is important the multi-part end point should be used /v2/collections/{collection}/documents/content-and-metadata.

The multi-part endpoint should also be used when the metadata exceeds what HTTP headers can store:

  • Non-ASCII characters

  • Metadata value containing line breaks

A GET request for a document will return the metadata that is set using the HTTP headers, in the metadata part of the returned JSON. The metadata part of the returned JSON will not contain metadata that the indexer has extracted from the document or added with external_metadata.

Time stamp metadata

Push (since version 15.6) will set the metadata X-Funnelback-Push-Received-Time to the time the document was PUT into Push. The date is a 19 character UTC time in the form: yyyyMMddHHmmss.SSSZ.

Anchor text and click logs

Anchor text and clicks logs are a good source of evidence that Funnelback can use to improve search results. To take advantage of these sources we need to add

-anniemode=3

to the query_processor_options via the 'edit data source configuration' screen. By default this option is on, however it will need to be used when Push is a part of a meta data source.

Click logs need to be processed before they can influence ranking, by default click logs are processed every hour. You may trigger processing of clicks logs from the RESTFul API

POST /push-api/v1/collections/{collection}/click-logs

See the Swagger UI for more details.

Resource requirements

Push’s memory requirements are related to the number and size of the Keys in the largest Push data source on the Funnelback server. In general, the sum of the length of all Keys in the Push data source, is the minimum amount of memory in bytes required for Push to be able to run that data source. Push runs under jetty so competes for access to memory with other parts of Funnelback such as the Modern UI. You should take this into consideration when setting up Funnelback especially when planning to work with data sources with more than 1 million URLs.

On top of Push’s memory requirements you will need to allocate memory for search indexes. Push’s indexes are not as efficient as other data source’s indexes and can require up to twice as much memory as other data sources. Setting a lower value for push.scheduler.killed-percentage-for-reindex can reduce the memory overhead however it will increase the CPU requirements.

Push is able to take advantage of multiple cores. For example it may be serving multiple API requests while committing and also merging. Although Push can run with a single core, it will work better with more. If searches are being performed on the same machine Push is indexing and merging, you should provide at least 2 cores preferably 4 or more depending on load. You can set the push.worker-thread-count option in global.cfg to change the number of threads Push will use for merges and commits.

Backing up

Push data sources cannot be backed just by copying the files within the data folder of the push data source. If you do this you will likely end up with a corrupt backup. If you wish to make a backup of the data source you will first need to make a snapshot of the push data source. A snapshot can be made via the RESTFul API (see Swagger UI for more details).

PUT /push-api/v1/collections/{collection}/snapshot/{name}

The snapshot will usually be created under:

$SEARCH_HOME/data/{collection}/snapshot/{name}

OR for windows

%SEARCH_HOME%\data\{collection}\snapshot\{name}

The snapshot is created with hard links, so creation of a snapshot is fast however the snapshot must not be edited until it has been copied.

Restoring a backup

If your push data source has failed or your disk has failed, you can easily restore a push data source from a backup by following these steps:

  • If the data source does not exist yet, then recreate the data source with exactly the same name.

  • Confirm that the push data source’s state is STOPPED by calling the RESTFul API with

GET /push-api/v1/collections/{collection}/state
  • If the push data source is not stopped, you must change the state of the data source to STOPPED by calling the RESTFul API with

POST /push-api/v1/collections/{collection}/state/stop
  • Now that the push data source has stopped you must copy all files within your backup over the top of the existing live files for example:

cp -rv $SEARCH_HOME/data/{collection}/snapshot/{name}/* $SEARCH_HOME/data/{collection}/live/
  • The push data source can now be used and searched.

Multiple query processors support

Push data sources can be replicated to multiple query processors. To set this up see Push multiple query processors

Limits

Push data sources do not impose any specific limit on the size of each individual document, however some subsystems do, and practical limits apply in a number of areas. The different limits are presented below and the smallest applicable value applies:

  • In the SXC and Funnelback SaaS a file upload limit of 20MB applies.

  • The Jetty web server imposes a maximum file size limit of 50MB. The default value can be altered by a system administrator by editing the $SEARCH_HOME/web/conf/contexts-https/funnelback-push-api.xml file. Please note that any changes to this file will be overwritten when upgrading between Funnelback versions.

URLs of up to 2000 characters are supported. URLs longer than 2000 characters are likely to cause problems in client or proxy systems and should be avoided.

Push does not support the following XML special configuration options:

Push does not support update workflow (ie. pre/post commands).

Push does not support kill configuration (kill_exact.cfg and kill_partial.cfg) as the push API includes native support for this functionality.

Security

A user will require access to the data source as well as sec.continuous.rest to be able to use the Push API.

Included Java API client

Funnelback comes with a Java API Client, to use this you will need the following jar’s available under a Funnelback installation:

  • share/funnelback-push-api-client-shaded.jar

  • lib/java/all/funnelback-api-client-core.jar

Alternatively you may instead just depend on all of lib/java/all.

Here is an example of using the Push API client:

import java.io.IOException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import com.funnelback.api.core.*;
import com.funnelback.push.client.PushContentV2Client;
import com.funnelback.push.client.responses.v2.AddedDocumentsResponseV2;


public class ApiExample {

    public void exampleAddAndDelete() throws IOException, APIException {
        URL url = new URL("https://<server domain>:<admin port>/push-api/");
        String userName = "user";
        String password = "complexPassword";
        String collectionId = "push-collection";

        Rest rest = new DefaultRest(userName, password, url);
        //Uncomment if you do not have a valid SSL certificate
        //new APIUtils().trustAllCerts(rest);

        PushContentV2Client client = new PushContentV2Client(rest);

        //public APIResponse<AddedDocumentsResponseV2> add(String collectionId, String key, byte[] content,
        //Map<String, Collection<String>> metadata, String contentType, String filterChain) throws IOException, APIException{

        String key = "http://example.com/";

        //Content must be converted to bytes, ensure you use the correct charset, if possible always convert to
        //UTF-8
        byte[] content = "<html><p>Hello</p></html>".getBytes(StandardCharsets.UTF_8);

        //Set some metadata for the document
        Map<String, Collection<String>> metadata = new HashMap<>();
        metadata.put("authors", Arrays.asList("Barry", "Bob"));

        String contentType = "text/html; charset=UTF-8";

        //Filters can be run on the given document, this value should be set in the same
        //way filter.classes in collection.cfg.
        String filterChain = "";

        //Add the document.
        APIResponse<AddedDocumentsResponseV2> response = client.add(collectionId, key, content, metadata, contentType, filterChain);
        System.out.println(response.getResponseBody().getMessage());
        System.out.println("Stored keys: " + response.getResponseBody().getstoredKeys());

        //Commit the changes, set true to wait for the commit to complete.
        client.commit(collectionId, true);

        //Now delete the document.
        client.delete(collectionId, key);
    }
}

The API client will use token based authentication, and well re-fetch the token when it receives a 401 using the given username and password. If the username and password change you will need to re-create the API client objects.

Troubleshooting

Log files

Push will log all errors for all data sources into the push log file located at:

$SEARCH_HOME/web/log/push.log
%SEARCH_HOME%\web\log\push.log

Push will log the output of index pipeline steps, for example padre-iw, under:

$SEARCH_HOME/data/collection_name/live/log/generation_number-/
%SEARCH_HOME\data\collection_name\live\log\generation_number-\

where generation_number is the generation number of the generation being created.

Merge failures

Push has a limit on the number of generations and so will constantly merge multiple generations into a single new generation allowing for new generations to be committed. If Push refuses to commit any more generations it is possible that merging has failed. This can occur because incorrect options or badly formatted files have been supplied to the indexer or other binaries in the index pipeline. When this happens you will be able to see merge errors in the push log file. As Push is constantly moving it is possible a merge error may only happen sporadically, to ensure it is possible to debug these issues you may set the following option in the data source configuration:

push.create-snapshot-on-merge-failure=true

Which will create a snapshot of the push data source if a merge fails. The snapshot will appear in the snapshot directory, like other snapshots, and will be named

FunnelbackInternal-merge-failed-generation_number

The snapshot can be restored to live, on another machine, for debugging.

Obtaining metrics and viewing threads.

It is possible to view some other metrics on how push is performing as well as view the state of all threads by going to:

https://host_name:admin_port/push-api/monitor/

Click Metrics to view the metrics and click Threads to get a thread dump.