gscopes.cfg (configuration file)
gscopes.cfg
defines a set of generalized scopes that are applied based on URL patterns.
Background
Generalized scopes can be used in numerous ways to narrow down searches to particular sub-parts of a collection. The gscopes.cfg
file is a standard place to store mappings from gscopes to the URL patterns that the numbers should be applied to.
Format
A text file, with one gscope definition per line.
GSCOPE-ID REGULAR-EXPRESSION
GSCOPE-ID
-
An alpha-numeric ASCII string no longer than 64 characters. White space and all other punctuation is not permitted.
GSCOPE-ID
values starting withFun
in any upper or lower case form are reserved for internal use only. REGULAR-EXPRESSION
-
a Perl5 compatible regular expressions that matches against the URL.
-
GSCOPE-ID
values can be used more than once with different regular expressions. The resulting gscope within the index will include the documents that match any of the supplied regular expressions. -
REGULAR-EXPRESSION
values can be used more than once with differentGSCOPE-ID
values. Any document within the index will be tagged with all matchingGSCOPE-ID
values. -
GSCOPE-ID
values specified in thequery-gscopes.cfg
are combined with any URL pattern based entries fromgscopes.cfg
. The resulting gscope within the index will include all documents that have either has a URL that matches a regular expression defined ingscopes.cfg
or a query defined inquery-gscopes.cfg
. -
Invalid
GSCOPE-ID
values will be skipped when an update runs and the matching rule will be excluded from the index. This will only raise a warning in theStep-SetGsopes.log
.
Examples
Maps government websites to different gscopes based on state:
act \.act\.gov\.au/ qld \.qld\.gov\.au/ tas \.tas\.gov\.au/ nsw \.nsw\.gov\.au/
Maps the 'documents' section of a website to gscope documents
. Additionally gives '.doc' files in the important subdirectory the gscope importantWordDocuments
:
documents www\.company\.com/documents/ importantWordDocuments www\.company\.com/documents/important/.*\.doc
Prefix the regular expression with the (?i) directive to use case-insensitive matching:
documents (?i)www\.company\.com/documents/
This will match URLs containing "Documents", "DOCUMENTS" "DoCuments" etc.