Gather and index update flows
The journey to take some content and get it indexed by Funnelback requires a number of steps to be performed, which are executed sequentially.
There are two basic update models employed by Funnelback, depending on the type of index technology that backs the data source.
A data source that is backed by a standard index follows an update cycle that must run through a series of chained steps in their entirety, where content is pulled into the index by a gatherer that runs on a schedule.
A new index is built in isolation from the currently available search index and is made available for search at the end of a successful update by swapping the existing live index for the newly built index.
This is the model used by the majority of Funnelback data sources.
Each of these phases must be completed for a successful update to occur. If something goes wrong an error will be raised and this will result in a failed update.
An update of a standard index has the following phases:
Gather: fetches content from the source repository.
Filter: transforms and analyzes the content.
Index: produces a searchable index of the content.
Reporting: creates data and broken link reports for web data sources.
Swap: performs a sanity check on the index comparing the number of documents in the new and previous index. If the check passes the new index is published and swaps with the live index.
Meta dependencies: performs a number of tasks to ensure that the search packages and results pages which use this data source are updated. e.g. builds spelling and auto-completion indexes for all the results pages using the updated data.
Archive: archives log files.
The gather phase covers the set of processes involved in retrieving the content from the content’s source repository.
The gather process needs to implement any logic required to connect to the data source and fetch the content. The gatherer is tailored for the type of content that is being gathered.
The overall scope of what to gather also needs to be considered.
The gatherer must implement:
Authentication with the content’s source repository.
Logic required to select what content should be included and excluded
Logic required to fetch the content from the data source.
Logic required to place limits on the gatherer run (such as an overall gather timeout).
The gatherer should also:
Implement logic required to add and remove content from an existing index (so that it doesn’t need to gather everything from scratch on each run). Note: this won’t apply to all content sources but should be implemented if it makes sense.
For a web data source the process of gathering is performed by a web crawler. The web crawler works by accessing a seed URL (or set of URLs), which may require authentication. This is fetched by the web crawler. The crawler then parses the downloaded HTML content and extracts all the links contained within the file. These links are added to a list of URLs (known as the crawl frontier) that the crawler needs to process.
Each URL in the frontier is processed in turn. The crawler needs to decide if this URL should be included in the search - this includes checking a set of include / exclude patterns,
robots.txt rules, file type and other attributes about the page. If all the checks are passed the crawler fetches the URL and stores the document locally. Links contained within the HTML are extracted and the process continues until the crawl frontier is empty, or a pre-determined timeout is reached.
The logic implemented by the web crawler includes a lot of additional features designed to optimize the crawl. On subsequent updates this includes the ability to decide if a URL has changed since the last visit by the crawler.
Filtering is the process of transforming and analyzing the downloaded content into text suitable for indexing by Funnelback.
This can cover a number of different scenarios including:
File format conversion - converting binary file formats such as PDF and Word documents into text.
Text mining and entity extraction.
Metadata generation and extraction.
The transformed version of the content is stored on disk at the conclusion of the filtering phase.
The title cleaning filter, which is implemented using a Funnelback plugin, is used to modify a document’s title so that when it is indexed the title does not include superfluous information such as a website name.
The filter parses the content prior to it being stored and modifies the title field based on a number of rules. This is then updated in the content and stored on disk for indexing.
|filtering only modifies the content that Funnelback has fetched, it doesn’t modify the document within its source repository.
The indexing phase creates a searchable index from the set of filtered documents downloaded by Funnelback.
The main search index is made up of an index of all words found in the filtered content and where the words occur. Additional indexes are built containing other data pertaining to the documents. These indexes include document metadata, link information, auto-completion and other document attributes (such as the modified date and file size and type).
Once the index is built it can be queried and does not require the source data used to generate the index.
The reporting phase only applies to web data sources and is used to generate legacy reporting.
The swap views phase serves two functions - it provides a sanity check on the index size and performs the operation of making the updated indexes live.
When using standard Funnelback indexes, two copies of the search index known as the live view and the offline view are retained.
When the update is run, all of the processes operate on the offline view. This offline view is used to store all the content from the new update and build the indexes. Once the indexes are built they are compared to what is currently in the live view - the set of indexes that are currently in a live state and available for querying.
The index sizes (in terms of document numbers) are compared. If Funnelback finds that the index has shrunk in size below a definable value (e.g. 50%) then the update will fail. This sanity check means that an update won’t succeed if the website was unavailable for a significant duration of the crawl.
An administrator can override this if the size reduction is expected. e.g. a new site has been launched and it’s a fraction of the size of the old site.
The meta dependencies phase is responsible for updating search package (and results page) indexes.
When a data source is included as part of a search package there are a number of additional indexes that need to be generated that span the full set of data that is included in a results page. This involves updating some shared search package indexes as well as results page specific indexes and includes:
Update of spelling indexes
Update of results page auto-completion indexes
A data source that is backed by a push index follows a transactional update cycle, where the content of an item is pushed to the index via the push API before being committed.
Content is added to an index in small batches that run through a similar set of steps to update the items in the searchable index. Content becomes available in push indexes in near real-time.
The push API defaults to a listening state and the update flow will depend on which API call is received.
An update of a push index has the following process flows:
When content is added (or updated), the item is placed into a processing queue. Upon commit (manual or automatic) the following processes run, creating a new push index generation:
Filter: transforms and analyzes the content. This is the same process that occurs with a standard index.
Index: produces a searchable index of the content. This is the same process that occurs with a standard index, although push indexes are structured a little differently.
Cleanup: performs a number of maintenance operations such as flagging items for removal in older index generations.
When an item is added to or removed from a push index the item needs to be killed from previous index generations to prevent duplicates appearing within the search results.
This operation happens automatically and ensures the integrity of the index.
When an index vacuum is requested all push index generations are merged into a single index generation.
This produces a clean index where all content flagged from removal in previous generations are physically removed (rather than just hidden) from the search index.
The index vacuum also applies any metadata mapping and gscope definition changes to all indexed content.