Push data source

Push data sources differ from other data sources such as web or database data sources as they do not gather (crawl) content. Push data sources require that a third party service pushes in content into them using the Push RESTFul API.

Overview

Push data sources are often used with XML content, although other content types like HTML are supported. Content is pushed in as individual documents (as opposite to having a single XML file containing multiple document nodes), with one API request per document.

Content can be added, retrieved, updated and removed via the API. Additional metadata can be associated with content if necessary (via HTTP headers, or multi-part HTTP requests). Each content has a unique identifier, a Key, which needs to be a valid URL. This key is used to retrieve, update or delete the content, and is used as the content URL when displaying search results.

Content is made available for search after the pending changes on a data source are committed. Commit can be done manually via the API, and will also happen automatically after a configured timeout or after a number of changes have been made.

Push data sources use the same configuration systems as other data sources but some index-related changes will require a complete re-index of the data source to take effect, via the vacuum API call.

RESTFul API

The API is documented using Swagger, which can also be used to interact with the API. To access the Swagger API user interface, go to the administration home page then under the System drop-down menu click on View API UI this will open up the Swagger UI, then click on push-api.

Or you can go directly to:

https://host_name:admin_port/search/admin/api-ui/#!/push-api

Getting started

You will first need to create a push data source from the admin home page. Once the data source is created you will not need to start the data source as it will be automatically started on the next API request.

Lets add a document with URL: http://myfirstdocument/ to your Push data source. You will need to perform a PUT request to add the document into your push data source, using the following URL and supplying the document in the body of the request.

PUT https://host_name:admin_port/push-api/v1/collections/collection_name/documents?key=http%3A%2F%2Fmyfirstdocument%2F

The easiest way to try this out is to use the Swagger UI, under the section:

PUT /push-api/v2/collections/{collection}/documents

Set the data source to the name of the data source you created, then set the key to:

http://myfirstdocument/

and set the content to:

The quick brown fox jumps over the lazy dog

don’t change the values of any other fields and then click try it out!

By default Push will make documents searchable as soon as possible, perform a search for fox over your data source and you should see a single result with URL http://myfirstdocument/.

If you wish to update the document, you can PUT the document in with the same key and different content. PUT requests will always replace exiting content at the given key. Refer to the Swagger UI documentation for more information.

Push also supports deleting documents. To delete the previously added document we would use the following DELETE call:

DELETE https://host_name:admin_port/push-api/v1/collections/examplecoll/documents?key=http%3A%2F%2Fmyfirstdocument%2F

You can try this from the Swagger UI under the section

DELETE /push-api/v2/collections/{collection}/documents

Set the data source to the name of the data source you created, then set the key to:

http://myfirstdocument/

then click try it out!

If you perform a search for fox over your data source the document will no longer be returned.

Authentication

The Push API accepts both basic authentication as well as token based authentication. Although basic authentication is simpler to use it is not efficient as the authentication will be re-performed server side for every request. It is preferable to use token based authentication if possible. See API token authentication for information about getting and supplying tokens.

Workflow

When changes to a push data source are made such as PUT-ing or DELETE-ing a document or PUT-ing a redirect, they are first stored in a staging area. These changes are not made live until they are committed.

Commits can be triggered through the API or automatically from a timeout or after a number of of changes have been made to the push data source.

Internally push data sources will need to merge generations created from commits. This is required to stop performance from degrading. Push will automatically trigger merges as required. Merging on large data sources can be costly and may impact query performance if run on a machine with insufficient CPUs or memory.

Index configuration changes require a Vacuum

Re-indexing a push data source is required when changing index configuration settings, such as indexer_options, GScopes, metadata mappings, etc.

To re-index a push data source, use the vacuum API call and set vacuum-type to RE_INDEX. This can be a long operation on large data sources as the complete content will be re-indexed.

Keys

Push expects all keys to be valid URIs, and expects that all keys have a scheme name such as http, ftp, local, etc. Push will also canonicalise keys - you can check the returned JSON for the key used by Push. In general, Push will:

Metadata

Extra metadata may be added to a document when it is PUT into Push. For the end-point shown above, Push uses HTTP headers that start with X-Funnelback-Push-Metadata-, the rest of the header name is used as the name as the metadata and the value is what is added to the document’s metadata for example setting the following HTTP header:

X-Funnelback-Push-Metadata-author: William Shakespeare

would set the 'author' metadata to 'William Shakespeare'. You will need to define a metadata mapping to access the metadata from the indexer and query processor. As HTTP headers are case insensitive in the HTTP specification but not the Java Servlet specification, metadata keys may be converted to lower-case in some environments. If case is important the multi-part end point should be used /v2/collections/{collection}/documents/content-and-metadata.

The multi-part endpoint should also be used when the metadata exceeds what HTTP headers can store:

  • Non-ASCII characters

  • Metadata value containing line breaks

A GET request for a document will return the metadata that is set using the HTTP headers, in the metadata part of the returned JSON. The metadata part of the returned JSON will not contain metadata that the indexer has extracted from the document or added with external_metadata.

Time stamp metadata

Push (since version 15.6) will set the metadata X-Funnelback-Push-Received-Time to the time the document was PUT into Push. The date is a 19 character UTC time in the form: yyyyMMddHHmmss.SSSZ.

Anchor text and click logs

Anchor text and clicks logs are a good source of evidence that Funnelback can use to improve search results. To take advantage of these sources we need to add

-anniemode=3

to the query_processor_options via the 'edit data source configuration' screen. By default this option is on, however it will need to be used when Push is a part of a meta data source.

Click logs need to be processed before they can influence ranking, by default click logs are processed every hour. You may trigger processing of clicks logs from the RESTFul API

POST /push-api/v1/collections/{collection}/click-logs

See the Swagger UI for more details.

Resource requirements

Push’s memory requirements are related to the number and size of the Keys in the largest Push data source on the Funnelback server. In general the sum of the length of all Keys in the Push data source, is the minimum amount of memory in bytes required for Push to be able to run that data source. Push runs under jetty so competes for access to memory with other parts of Funnelback such as the Modern UI. You should take this into consideration when setting up Funnelback especially when planing to work with data sources with more than 1 million URLs.

On top of Push’s memory requirements you will need to allocate memory for search indexes. Push’s indexes are not as efficient as other data source’s indexes and can require up to twice as much memory as other data sources. Setting a lower value for push.scheduler.killed-percentage-for-reindex can reduce the memory overhead however it will increase the CPU requirements.

Push is able to take advantage of multiple cores. For example it may be serving multiple API requests while committing and also merging. Although Push can run with a single core, it will work better with more. If searches are being performed on the same machine Push is indexing and merging, you should provide at least 2 cores preferably 4 or more depending on load. You can set the push.worker-thread-count option in global.cfg to change the number of threads Push will use for merges and commits.

Backing up

Push data sources cannot be backed by just by copying the files within the data folder of the push data source. If you do this you will likely end up with a corrupt backup. If you wish to make a backup of the data source you will first need to make a snapshot of the push data source. A snapshot can be made via the RESTFul API (see Swagger UI for more details).

PUT /push-api/v1/collections/{collection}/snapshot/{name}

The snapshot will usually be created under:

$SEARCH_HOME/data/{collection}/snapshot/{name}

OR for windows

%SEARCH_HOME%\data\{collection}\snapshot\{name}

The snapshot is created with hard links, so creation of a snapshot is fast however the snapshot must not be edited until it has been copied.

Restoring a backup

If your push data source has failed or your disk has failed, you can easily restore a push data source from a backup by following these steps:

  • If the data source does not exist yet, then recreate the data source with exactly the same name.

  • Confirm that the push data source’s state is STOPPED by calling the RESTFul API with

GET /push-api/v1/collections/{collection}/state
  • If the push data source is not stopped, you must change the state of the data source to STOPPED by calling the RESTFul API with

POST /push-api/v1/collections/{collection}/state/stop
  • Now that the push data source has stopped you must copy all files within your backup over the top of the existing live files for example:

cp -rv $SEARCH_HOME/data/{collection}/snapshot/{name}/* $SEARCH_HOME/data/{collection}/live/
  • The push data source can now be used and searched.

Multiple query processors support

Push data sources can be replicated to multiple query processors. To set this up see Push multiple query processors

Limits

Push data sources do not impose any specific limit on the size of each individual document, however some subsystems do, and practical limits apply in a number of areas. The different limits are presented below and the smallest applicable value applies:

  • In the SXC and Funnelback shared and dedicated hosted environments a file upload limit of 20MB applies.

  • The Jetty web server imposes a maximum file size limit of 50MB. The default value can be altered by a system administrator by editing the $SEARCH_HOME/web/conf/contexts-https/funnelback-push-api.xml file. Please note that any changes to this file will be overwritten when upgrading between Funnelback versions.

URLs of up to 2000 characters are supported. URLs longer than 2000 characters are likely to cause problems in client or proxy systems and should be avoided.

Push does not support the following XML special configuration options:

Push does not support update workflow (ie. pre/post commands).

Push does not support kill configuration (kill_exact.cfg and kill_partial.cfg) as the push API includes native support for this functionality.

Security

A user will require access to the data source as well as sec.continuous.rest to be able to use the Push API.

Included Java API client

Funnelback comes with a Java API Client, to use this you will need the following jar’s available under a Funnelback installation:

  • share/funnelback-push-api-client-shaded.jar

  • lib/java/all/funnelback-api-client-core.jar

Alternativly you may instead just depend on all of lib/java/all.

Here is an example of using the Push API client:

import java.io.IOException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import com.funnelback.api.core.*;
import com.funnelback.push.client.PushContentV2Client;
import com.funnelback.push.client.responses.v2.AddedDocumentsResponseV2;


public class ApiExample {

    public void exampleAddAndDelete() throws IOException, APIException {
        URL url = new URL("https://<server domain>:<admin port>/push-api/");
        String userName = "user";
        String password = "complexPassword";
        String collectionId = "push-collection";

        Rest rest = new DefaultRest(userName, password, url);
        //Uncomment if you do not have a valid SSL certificate
        //new APIUtils().trustAllCerts(rest);

        PushContentV2Client client = new PushContentV2Client(rest);

        //public APIResponse<AddedDocumentsResponseV2> add(String collectionId, String key, byte[] content,
        //Map<String, Collection<String>> metadata, String contentType, String filterChain) throws IOException, APIException{

        String key = "http://example.com/";

        //Content must be converted to bytes, ensure you use the correct charset, if possible always convert to
        //UTF-8
        byte[] content = "<html><p>Hello</p></html>".getBytes(StandardCharsets.UTF_8);

        //Set some metadata for the document
        Map<String, Collection<String>> metadata = new HashMap<>();
        metadata.put("authors", Arrays.asList("Barry", "Bob"));

        String contentType = "text/html; charset=UTF-8";

        //Filters can be run on the given document, this value should be set in the same
        //way filter.classes in collection.cfg.
        String filterChain = "";

        //Add the document.
        APIResponse<AddedDocumentsResponseV2> response = client.add(collectionId, key, content, metadata, contentType, filterChain);
        System.out.println(response.getResponseBody().getMessage());
        System.out.println("Stored keys: " + response.getResponseBody().getstoredKeys());

        //Commit the changes, set true to wait for the commit to complete.
        client.commit(collectionId, true);

        //Now delete the document.
        client.delete(collectionId, key);
    }
}

The API client will use token based authentication, and well re-fetch the token when it receives a 401 using the given username and password. If the username and password change you will need to re-create the API client objects.

Troubleshooting

Log files

Push will log all errors for all data sources into the push log file located at:

$SEARCH_HOME/web/log/push.log
%SEARCH_HOME%\web\log\push.log

Push will log the output of index pipeline steps, for example padre-iw, under:

$SEARCH_HOME/data/collection_name/live/log/generation_number-/
%SEARCH_HOME\data\collection_name\live\log\generation_number-\

where generation_number is the generation number of the generation being created.

Merge failures

Push has a limit on the number of generations and so will constantly merge multiple generations into a single new generation allowing for new generations to be committed. If Push refuses to commit any more generations it is possible that merging has failed. This can occur because incorrect options or badly formatted files have been supplied to the indexer or other binaries in the index pipeline. When this happens you will be able to see merge errors in the push log file. As Push is constantly moving it is possible a merge error may only happen sporadically, to ensure it is possible to debug these issues you may set the following option in the data source configuration:

push.create-snapshot-on-merge-failure=true

Which will create a snapshot of the push data source if a merge fails. The snapshot will appear in the snapshot directory, like other snapshots, and will be named

FunnelbackInternal-merge-failed-generation_number

The snapshot can be restored to live, on another machine, for debugging.

File issues on Windows

On Windows if a file is open it cannot replaced. This can be a issue when a external process opens files under a Push data source’s live view, likely causing commit or merge failures. Examples of programs which might cause this problem are anti-virus software and the Windows search service. Funnelback administrators should ensure that no external program is reading files created by push under the live directory of a Push data source.

If problems arise and you are unable to determine which program is opening files, Push includes support for running Handle when an issue occurs, which can report on which process has a file open. To enable this you will need to download Handle.exe and execute Handle as the user Funnelback is running as. Typically this can be done using PsExec with something similar to the following command line:

PsExec.exe -i -s Handle.exe

By doing this you will be able to read and accept Handle’s user the license, a step which would otherwise cause the process to become stuck waiting for confirmation. It is recommend to run the above command twice to confirm you are not asked to read and accept the license on the second attempt.

Once the license is accepted, this additional debugging information can be enabled by setting Handle.exe’s location in executables_cfg.adoc:

handle=C:\foo\bar\Handle.exe

With this setting configured, push will execute Handle when it cannot replace a file, and include the information Handle produces within any exceptions which are logged or returned from API calls, which should help identify which program is opening a Push data source’s live files.

Obtaining metrics and viewing threads.

It is possible to view some other metrics on how push is performing as well as view the state of all threads by going to:

https://host_name:admin_port/push-api/monitor/

Click Metrics to view the metrics and click Threads to get a thread dump.

© 2015- Squiz Pty Ltd