Implementer training - Configuring url sets (generalized scopes)

The generalized scopes mechanism in Funnelback allows an administrator to group sets of documents that match a set of URL patterns (e.g. */publications/*), or all the URLs returned by a specified query (e.g. author:shakespeare).

Once defined these groupings can be used for:

  • Scoped searches (provide a search that only looks within a particular set of documents)

  • Creating additional services (providing a search service with separate templates, analytics and configuration that is limited to a particular set of documents).

  • Faceted navigation categories (Count the number of documents in the result set that match this grouping).

The patterns used to match against the URLs are Perl regular expressions allowing for very complex matching rules to be defined. If you don’t know what a regular expression is don’t worry as simple substring matching will also work.

The specified query can be anything that is definable using the Funnelback query language.

Generalized scopes are a good way of adding some structure to an index that lacks any metadata, by either making use of the URL structure, or by creating groupings based on pre-defined queries.

Metadata should always be used in preference to generalized scopes where possible as gscopes carry a much higher maintenance overhead.

URLs can be grouped into multiple sets by having additional patterns defined within the configuration file.

Tutorial: Configuring URL sets that match a URL pattern

The process for creating configuration for generalized scopes is very similar to that for external metadata.

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Navigate to the manage data source screen for the silent films - website data source.

  3. Select manage data source configuration files from the settings panel.

  4. Create a gscopes.cfg by clicking the add new button, then selecting gscopes.cfg from the file type menu, then clicking the save button.

    exercise configuring url sets that match a url pattern 01
  5. Click on the gscopes.cfg item in the file listing. A blank file editor screen will load. We will define URL groupings that groups together a set of pages about Charlie Chaplin.

    When defining gscopes there is often many different ways of achieving the same result.

    The following pattern tells Funnelback to create a set of URLs with a gscope ID of charlie that is made up of any URL containing the substring /details/CC_:

    charlie /details/CC_

    The following would probably also achieve the same result. This tells Funnelback to tag the listed URLs with a gscope ID of charlie.

    the match is still a substring but this time the match is much more precise so each item is likely to only match a single URL. Observe also that it is possible to assign the same gscope ID to many patterns:
    charlie https://archive.org/details/CC_1916_05_15_TheFloorwalker
    charlie https://archive.org/details/CC_1916_07_10_TheVagabond
    charlie https://archive.org/details/CC_1914_03_26_CruelCruelLove
    charlie https://archive.org/details/CC_1914_02_02_MakingALiving
    charlie https://archive.org/details/CC_1914_09_07_TheRounders
    charlie https://archive.org/details/CC_1914_05_07_ABusyDay
    charlie https://archive.org/details/CC_1914_07_09_LaffingGas
    charlie https://archive.org/details/CC_1916_09_04_TheCount
    charlie https://archive.org/details/CC_1915_02_01_HisNewJob
    charlie https://archive.org/details/CC_1914_06_13_MabelsBusyDay
    charlie https://archive.org/details/CC_1914_11_07_MusicalTramps
    charlie https://archive.org/details/CC_1916_12_04_TheRink
    charlie https://archive.org/details/CC_1914_12_05_AFairExchange
    charlie https://archive.org/details/CC_1914_06_01_TheFatalMallet
    charlie https://archive.org/details/CC_1914_06_11_TheKnockout
    charlie https://archive.org/details/CC_1914_03_02_FilmJohnny
    charlie https://archive.org/details/CC_1914_04_27_CaughtinaCaberet
    charlie https://archive.org/details/CC_1914_10_10_TheRivalMashers
    charlie https://archive.org/details/CC_1914_11_09_HisTrystingPlace
    charlie https://archive.org/details/CC_1914_08_27_TheMasquerader
    charlie https://archive.org/details/CC_1916_05_27_Police
    charlie https://archive.org/details/CC_1916_10_02_ThePawnshop
    charlie https://archive.org/details/CC_1915_10_04_CharlieShanghaied
    charlie https://archive.org/details/CC_1916_06_12_TheFireman
    charlie https://archive.org/details/CC_1914_02_28_BetweenShowers
    charlie https://archive.org/details/CC_1918_09_29_TheBond
    charlie https://archive.org/details/CC_1918_xx_xx_TripleTrouble
    charlie https://archive.org/details/CC_1914_08_31_TheGoodforNothing
    charlie https://archive.org/details/CC_1914_04_20_TwentyMinutesofLove
    charlie https://archive.org/details/CC_1914_03_16_HisFavoritePasttime
    charlie https://archive.org/details/CC_1917_10_22_TheAdventurer
    charlie https://archive.org/details/CC_1914_06_20_CharlottEtLeMannequin
    charlie https://archive.org/details/CC_1917_06_17_TheImmigrant
    charlie https://archive.org/details/CC_1916_11_13_BehindtheScreen
    charlie https://archive.org/details/CC_1914_08_10_FaceOnTheBarroomFloor
    charlie https://archive.org/details/CC_1914_10_29_CharlottMabelAuxCourses
    charlie https://archive.org/details/CC_1914_10_26_DoughandDynamite
    charlie https://archive.org/details/CC_1914_12_07_HisPrehistoricpast
    charlie https://archive.org/details/CC_1914_02_09_MabelsStrangePredicament
    charlie https://archive.org/details/CC_1914_11_14_TilliesPuncturedRomance
    charlie https://archive.org/details/CC_1915_12_18_ABurlesqueOnCarmen
    charlie https://archive.org/details/CC_1914_08_01_CharolotGargonDeTheater
    charlie https://archive.org/details/CC_1917_04_16_TheCure
    charlie https://archive.org/details/CC_1916_08_07_One_A_M
    charlie https://archive.org/details/CC_1914_08_13_CharliesRecreation
    charlie https://archive.org/details/CC_1914_02_07_KidsAutoRaceAtVenice
    charlie https://archive.org/details/CC_1914_04_04_TheLandladysPet

    Finally, the following regular expression would also achieve the same result.

    charlie archive.org/details/CC_.*$

    This may seem a bit confusing, but you need to keep in mind that the defined pattern can be as general or specific as you like - the trade-off is on what will match. The pattern needs to be specific enough to match the items you want but exclude those that shouldn’t be matched.

    Copy and paste the following into your gscopes.cfg and click the save button. This will set up two URL sets - the first (charlie) matching a subset of pages about Charlie Chaplin and the second (buster) matching a set of pages about Buster Keaton.

    charlie /details/CC_
    buster archive.org/details/Cops1922
    buster archive.org/details/Neighbors1920
    buster archive.org/details/DayDreams1922
    buster archive.org/details/OneWeek1920
    buster archive.org/details/Convict13_201409
    buster archive.org/details/HardLuck_201401
    buster archive.org/details/ThePlayHouse1921
    buster archive.org/details/College_201405
    buster archive.org/details/TheScarecrow1920
    buster archive.org/details/MyWifesRelations1922
    buster archive.org/details/TheHighSign_201502
    buster archive.org/details/CutTheGoat1921
    buster archive.org/details/TheFrozenNorth1922
    buster archive.org/details/BusterKeatonsThePaleface
  6. Rebuild the index (Select start advanced update from the update panel, then select reapply gscopes to live view and click the update button) to apply these generalized scopes to the index.

  7. Confirm that the gscopes are applied. Run a search for day dreams and view the JSON/XML data model. Locate the results and observe the values of the gscopesSet field. Items that match one of the Buster Keaton films listed above should have a value of buster set.

    If you see gscopes set that look like FUN followed by a random string of letters and numbers, these are gscopes that are defined by Funnelback when you create faceted navigation based on queries.
  8. Use gscopes to scope the search. Run a search for !showeverything. Add &gscope1=charlie to the URL and press enter. Observe that all the results are now restricted to films featuring Charlie Chaplin (and more specifically all the URLs contain /details/CC_ as a substring). Change the URL to have &gscope1=buster and rerun the search. This time all the results returned should be links to films featuring Buster Keaton. Advanced scoping, combining gscopes is also possible using reverse polish notation when configuring query processor options. See the documentation above for more information.

Tutorial: Configuring URL sets that match a Funnelback query

  1. Log in to the search dashboard where you are doing your training.

    See: Training - search dashboard access information if you’re not sure how to access the training. Ignore this step if you’re treating this as a non-interactive tutorial.
  2. Navigate to the manage data source screen for the silent films - website data source.

  3. Select manage data source configuration files from the settings panel.

  4. Create a query-gscopes.cfg by clicking the add new button, then selecting query-gscopes.cfg from the file type menu, then clicking the save button.

  5. A blank file editor screen will load. We will define a URL set containing all silent movies about christmas.

    The following pattern tells Funnelback to create a set of URLs with a gscope ID of XMAS that is made up of the set of URLs returned when searching for christmas:

    XMAS christmas

    The query is specified using Funnelback’s query language and supports any advanced operators that can be passed in via the search box.

  6. Rebuild the index (Select start advanced update from the update panel, then select reapply gscopes to live view and click the update button) to apply these generalized scopes to the index.

  7. Confirm that the gscopes are applied. Run a search for christmas and view the JSON/XML data model. Locate the results and observe the values of the gscopesSet field. The returned items should have a value of XMAS set.

  8. Use gscopes to scope the search. Run a search for !showeverything. Add &gscope1=XMAS to the URL and press enter. Observe that all the results are now restricted to the films about christmas. Replace gscope1=XMAS with gscope1=xmas and observe that the gscope value is case-sensitive.

Extended exercises and questions: URL sets (gscopes)
  • Redo the first gscopes exercise, but with the alternate pattern sets defined in step 4 of the exercise. Compare the results and observe that a similar result is achieved with the three different pattern sets.

  • Create a generalised scope that contains all documents where the director is Alfred Hitchcock

  • Why is using gscopes to apply keywords higher maintenance than using a metadata field?

  • Construct a reverse-polish gscope expression that includes charlie OR christmas but not buster. Hint: Gscope expressions