Restricting search results access by IP address/domain

Access to search results can be restricted to a set of IP addresses or domains.

If you have documents that can generally be classified as internal or external, and you do not require more fine-grained control than that then restricting access by IP address would be appropriate for your needs.

Internal documents are those for use within your organization only (not for public consumption). To index internal documents, but still have a public index, it is necessary to build:

Internal index: covering all available documents, including internal documents.

External index: covering only documents available externally (not internal documents).

To build these two indexes (or more complex configurations), you will need techniques for access restriction and controlling search scope.

Results page access restriction using an access token

Access to search results can be restricted to requests that supply a token in the request headers each time a request is made.

Token-based access to search results pages is provided by enabling the access restriction to search results plugin.

Results page access restriction via hostname and IP address

Hostname suffix and IP address access restrictions can be applied to a results page by setting the access_restriction option. When this setting denies access, the access_alternate setting can be used to direct the user to an alternate results page automatically.

Controlling search scope

The internal and external indexes have different scope (index different document sets). This section describes three ways of controlling scope: rule based crawl, proxy crawl and password crawl.

A rule based crawl simply involves setting appropriate include_patterns and exclude_patterns for each data source. This approach is only recommended in situations where the appearance of new internal-only areas is monitored, since each time this happens the external index’s rules should be reviewed.

A proxy crawl is more foolproof than a rule-based approach, but relies on the availability of an external proxy. Consider a university as an example. A rule-based external index of the university may be difficult to maintain, since there are a large number of volatile servers and new internal-only documents may be published without warning.

Instead, the university could set up a proxy with a non-university address. The external index is configured to crawl through this proxy (using the http_proxy options in the collection.cfg file) allowing an "external crawl" on an internal machine.

A password crawl is different from the others. It is used in cases where the internal documents are protected by password rather than IP/hostname restriction. In that case, the external index can be a plain crawl of available server(s) and the internal index can have http_user and http_passwd options to get the internal documents.