Restricting search results access by IP address/domain
Access to search results can be restricted to a set of IP addresses or domains.
If you have documents that can generally be classified as internal or external, and you do not require more fine-grained control than that then restricting access by IP address would be appropriate for your needs.
Internal documents are those for use within your organisation only (not for public consumption). To index internal documents, but still have a public index, it is necessary to build:
Internal index: covering all available documents, including internal documents.
External index: covering only documents available externally (not internal documents).
To build these two indexes (or more complex configurations), you will need techniques for access restriction and controlling search scope.
The internal and external indexes have different scope (index different document sets). This section describes three ways of controlling scope: rule based crawl, proxy crawl and password crawl.
A rule based crawl simply involves setting appropriate
exclude_patterns for each data source. This approach is only recommended in situations where the appearance of new internal-only areas is monitored, since each time this happens the external index’s rules should be reviewed.
A proxy crawl is more foolproof than a rule-based approach, but relies on the availability of an external proxy. Consider a university as an example. A rule-based external index of the university may be difficult to maintain, since there are a large number of volatile servers and new internal-only documents may be published without warning.
Instead, the university could set up a proxy with a non-university address. The external index is configured to crawl through this proxy (using the
http_proxy options in the
collection.cfg file) allowing an "external crawl" on an internal machine.
A password crawl is different from the others. It is used in cases where the internal documents are protected by password rather than IP/hostname restriction. In that case, the external index can be a plain crawl of available server(s) and the internal index can have
http_passwd options to get the internal documents.