Implementer training - Configuring url sets (generalized scopes)
The generalized scopes mechanism in Funnelback allows an administrator to group sets of documents that match a set of URL patterns (e.g. */publications/*
), or all the URLs returned by a specified query (e.g. author:shakespeare
).
Once defined these groupings can be used for:
-
Scoped searches (provide a search that only looks within a particular set of documents)
-
Creating additional services (providing a search service with separate templates, analytics and configuration that is limited to a particular set of documents).
-
Faceted navigation categories (Count the number of documents in the result set that match this grouping).
The patterns used to match against the URLs are Perl regular expressions allowing for very complex matching rules to be defined. If you don’t know what a regular expression is don’t worry as simple substring matching will also work.
The specified query can be anything that is definable using the Funnelback query language.
Generalized scopes are a good way of adding some structure to an index that lacks any metadata, by either making use of the URL structure, or by creating groupings based on pre-defined queries.
Metadata should always be used in preference to generalized scopes where possible as gscopes carry a much higher maintenance overhead. |
URLs can be grouped into multiple sets by having additional patterns defined within the configuration file.
Tutorial: Configuring URL sets that match a URL pattern
The process for creating configuration for generalized scopes is very similar to that for external metadata.
-
Log in to the search dashboard where you are doing your training.
See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial. -
Navigate to the manage data source screen for the silent films - website data source.
-
Select manage data source configuration files from the settings panel.
-
Create a
gscopes.cfg
by clicking the add new button, then selectinggscopes.cfg
from the file type menu, then clicking the save button. -
Click on the gscopes.cfg item in the file listing. A blank file editor screen will load. We will define URL groupings that groups together a set of pages about Charlie Chaplin.
When defining gscopes there is often many different ways of achieving the same result.
The following pattern tells Funnelback to create a set of URLs with a gscope ID of
charlie
that is made up of any URL containing the substring/details/CC_
:charlie /details/CC_
The following would probably also achieve the same result. This tells Funnelback to tag the listed URLs with a gscope ID of charlie.
the match is still a substring but this time the match is much more precise so each item is likely to only match a single URL. Observe also that it is possible to assign the same gscope ID to many patterns: charlie https://archive.org/details/CC_1916_05_15_TheFloorwalker charlie https://archive.org/details/CC_1916_07_10_TheVagabond charlie https://archive.org/details/CC_1914_03_26_CruelCruelLove charlie https://archive.org/details/CC_1914_02_02_MakingALiving charlie https://archive.org/details/CC_1914_09_07_TheRounders charlie https://archive.org/details/CC_1914_05_07_ABusyDay charlie https://archive.org/details/CC_1914_07_09_LaffingGas charlie https://archive.org/details/CC_1916_09_04_TheCount charlie https://archive.org/details/CC_1915_02_01_HisNewJob charlie https://archive.org/details/CC_1914_06_13_MabelsBusyDay charlie https://archive.org/details/CC_1914_11_07_MusicalTramps charlie https://archive.org/details/CC_1916_12_04_TheRink charlie https://archive.org/details/CC_1914_12_05_AFairExchange charlie https://archive.org/details/CC_1914_06_01_TheFatalMallet charlie https://archive.org/details/CC_1914_06_11_TheKnockout charlie https://archive.org/details/CC_1914_03_02_FilmJohnny charlie https://archive.org/details/CC_1914_04_27_CaughtinaCaberet charlie https://archive.org/details/CC_1914_10_10_TheRivalMashers charlie https://archive.org/details/CC_1914_11_09_HisTrystingPlace charlie https://archive.org/details/CC_1914_08_27_TheMasquerader charlie https://archive.org/details/CC_1916_05_27_Police charlie https://archive.org/details/CC_1916_10_02_ThePawnshop charlie https://archive.org/details/CC_1915_10_04_CharlieShanghaied charlie https://archive.org/details/CC_1916_06_12_TheFireman charlie https://archive.org/details/CC_1914_02_28_BetweenShowers charlie https://archive.org/details/CC_1918_09_29_TheBond charlie https://archive.org/details/CC_1918_xx_xx_TripleTrouble charlie https://archive.org/details/CC_1914_08_31_TheGoodforNothing charlie https://archive.org/details/CC_1914_04_20_TwentyMinutesofLove charlie https://archive.org/details/CC_1914_03_16_HisFavoritePasttime charlie https://archive.org/details/CC_1917_10_22_TheAdventurer charlie https://archive.org/details/CC_1914_06_20_CharlottEtLeMannequin charlie https://archive.org/details/CC_1917_06_17_TheImmigrant charlie https://archive.org/details/CC_1916_11_13_BehindtheScreen charlie https://archive.org/details/CC_1914_08_10_FaceOnTheBarroomFloor charlie https://archive.org/details/CC_1914_10_29_CharlottMabelAuxCourses charlie https://archive.org/details/CC_1914_10_26_DoughandDynamite charlie https://archive.org/details/CC_1914_12_07_HisPrehistoricpast charlie https://archive.org/details/CC_1914_02_09_MabelsStrangePredicament charlie https://archive.org/details/CC_1914_11_14_TilliesPuncturedRomance charlie https://archive.org/details/CC_1915_12_18_ABurlesqueOnCarmen charlie https://archive.org/details/CC_1914_08_01_CharolotGargonDeTheater charlie https://archive.org/details/CC_1917_04_16_TheCure charlie https://archive.org/details/CC_1916_08_07_One_A_M charlie https://archive.org/details/CC_1914_08_13_CharliesRecreation charlie https://archive.org/details/CC_1914_02_07_KidsAutoRaceAtVenice charlie https://archive.org/details/CC_1914_04_04_TheLandladysPet
Finally, the following regular expression would also achieve the same result.
charlie archive.org/details/CC_.*$
This may seem a bit confusing, but you need to keep in mind that the defined pattern can be as general or specific as you like - the trade-off is on what will match. The pattern needs to be specific enough to match the items you want but exclude those that shouldn’t be matched.
Copy and paste the following into your
gscopes.cfg
and click the save button. This will set up two URL sets - the first (charlie
) matching a subset of pages about Charlie Chaplin and the second (buster
) matching a set of pages about Buster Keaton.charlie /details/CC_ buster archive.org/details/Cops1922 buster archive.org/details/Neighbors1920 buster archive.org/details/DayDreams1922 buster archive.org/details/OneWeek1920 buster archive.org/details/Convict13_201409 buster archive.org/details/HardLuck_201401 buster archive.org/details/ThePlayHouse1921 buster archive.org/details/College_201405 buster archive.org/details/TheScarecrow1920 buster archive.org/details/MyWifesRelations1922 buster archive.org/details/TheHighSign_201502 buster archive.org/details/CutTheGoat1921 buster archive.org/details/TheFrozenNorth1922 buster archive.org/details/BusterKeatonsThePaleface
-
Rebuild the index (Select start advanced update from the update panel, then select reapply gscopes to live view and click the update button) to apply these generalized scopes to the index.
-
Confirm that the gscopes are applied. Run a search for day dreams and view the JSON/XML data model. Locate the results and observe the values of the
gscopesSet
field. Items that match one of the Buster Keaton films listed above should have a value ofbuster
set.If you see gscopes set that look like FUN followed by a random string of letters and numbers, these are gscopes that are defined by Funnelback when you create faceted navigation based on queries. -
Use gscopes to scope the search. Run a search for !showeverything. Add
&gscope1=charlie
to the URL and press enter. Observe that all the results are now restricted to films featuring Charlie Chaplin (and more specifically all the URLs contain/details/CC_
as a substring). Change the URL to have&gscope1=buster
and rerun the search. This time all the results returned should be links to films featuring Buster Keaton. Advanced scoping, combining gscopes is also possible using reverse polish notation when configuring query processor options. See the documentation above for more information.
Tutorial: Configuring URL sets that match a Funnelback query
-
Log in to the search dashboard where you are doing your training.
See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial. -
Navigate to the manage data source screen for the silent films - website data source.
-
Select manage data source configuration files from the settings panel.
-
Create a
query-gscopes.cfg
by clicking the add new button, then selectingquery-gscopes.cfg
from the file type menu, then clicking the save button. -
A blank file editor screen will load. We will define a URL set containing all silent movies about christmas.
The following pattern tells Funnelback to create a set of URLs with a gscope ID of
XMAS
that is made up of the set of URLs returned when searching for christmas:XMAS christmas
The query is specified using Funnelback’s query language and supports any advanced operators that can be passed in via the search box.
-
Rebuild the index (Select start advanced update from the update panel, then select reapply gscopes to live view and click the update button) to apply these generalized scopes to the index.
-
Confirm that the gscopes are applied. Run a search for christmas and view the JSON/XML data model. Locate the results and observe the values of the
gscopesSet
field. The returned items should have a value ofXMAS
set. -
Use gscopes to scope the search. Run a search for !showeverything. Add
&gscope1=XMAS
to the URL and press enter. Observe that all the results are now restricted to the films about christmas. Replacegscope1=XMAS
withgscope1=xmas
and observe that the gscope value is case-sensitive.
Extended exercises and questions: URL sets (gscopes)
-
Redo the first gscopes exercise, but with the alternate pattern sets defined in step 4 of the exercise. Compare the results and observe that a similar result is achieved with the three different pattern sets.
-
Create a generalised scope that contains all documents where the director is Alfred Hitchcock
-
Why is using gscopes to apply keywords higher maintenance than using a metadata field?
-
Construct a reverse-polish gscope expression that includes charlie OR christmas but not buster. Hint: Gscope expressions