Creating an index of database-like content
This article describes some techniques that can be used to index database content without the need for Funnelback to directly access the database.
This is often required if database content is to be included in a public facing search where it is not desirable to provide Funnelback with direct access to the database for security reasons.
Export the database
One approach is to set up a process that periodically exports the table of results as XML (for the same query that would be run if Funnelback was connecting directly to the database) to a location that Funnelback can access (e.g. a file hosted on a web server).
This export process could run periodically via an internal process, or could even be triggered at the time of update by writing a web service that Funnelback connects to that then triggers the export.
Although Funnelback can index JSON and CSV it is preferable to provide the data in XML format as Funnelback will convert JSON and CSV to XML internally).
Once an export is set up and the XML file(s) made available from a web location then Funnelback can be configured using a web data source to retrieve and index the XML.
Example: Export of a university courses database
The XML below represents some records converted from a courses database.
There would need to be a single record for each course and unit containing all the fields that are relevant to the item.
Data can be provided in a single XML file (quicker to download and update) or as individual XML files.
<courses>
<course>
<course_name>Bachelor of Science</course_name>
<faculty>Faculty of Science and Engineering</faculty>
<course_code>SCI001</course_code>
<campuses>
<site>Campus 1</site>
<site>Campus 2</site>
</campuses>
<study_areas>
<area>Physics</area>
<area>Chemistry</area>
</study_areas>
<entrance_mark>73.5</entrance_mark>
<course_type>Undergraduate</course_type>
<course_schedule>Full time</course_schedule>
...
</course>
<course>
<course_name>Bachelor of Engineering</course_name>
<faculty>Faculty of Science and Engineering</faculty>
<course_code>ENG001</course_code>
<campuses>
<site>Campus 1</site>
<site>Campus 2</site>
</campuses>
<study_areas>
<area>Physics</area>
<area>Chemistry</area>
</study_areas>
<entrance_mark>89.5</entrance_mark>
<course_type>Undergraduate</course_type>
<course_schedule>Full time</course_schedule>
...
</course>
</courses>
<units>
<unit>
<unit_name>Physics 101</unit_name>
<faculty>Faculty of Science and Engineering</faculty>
<unit_code>SCI001</unit_code>
<campuses>
<site>Campus 1</site>
<site>Campus 2</site>
</campuses>
<degrees>
<degree>Bacheclor of Science</degree>
<degree>Bacheclor of Science, Bachelor of Engineering</degree>
<degree>Bachelor of Engineeringe</degree>
</degrees>
<study_areas>
<area>Physics</area>
</study_areas>
<points>40</point>
<course_type>Undergraduate</course_type>
...
</unit>
<unit>
<unit_name>Physics 101</unit_name>
<faculty>Faculty of Science and Engineering</faculty>
<unit_code>SCI001</unit_code>
<campuses>
<site>Campus 1</site>
<site>Campus 2</site>
</campuses>
<degrees>
<degree>Bacheclor of Science</degree>
<degree>Bacheclor of Science, Bachelor of Engineering</degree>
<degree>Bachelor of Engineeringe</degree>
</degrees>
<study_areas>
<area>Physics</area>
</study_areas>
<points>40</point>
<course_type>Undergraduate</course_type>
...
</unit>
</units>
Provide API access to the database
Access to a database could be provided using a set of REST API calls. A custom gatherer plugin could then be written to perform the required API calls to retrieve the database records.
Embed the database content in HTML pages
Database content is often exposed via web pages. These pages can be crawled using a web data source.
The generated pages should include metadata fields to make the fielded information available within Funnelback - producing metadata fields for each database field.
If the record level pages are unlinked they should all be listed in a sitemap.xml
file that is linked from the website’s robots.txt
, and the website crawler with crawler.use_sitemap_xml=true
.
Example: University courses pages
The following is an example for what a University customer may need to provide in order to index a set of course records sourced from a website.
Funnelback crawls course and unit pages using the web crawler, downloads and indexes the HTML pages for each course and unit.
Funnelback would need a way of identifying the course pages, or be provided with a list of all the URLs of the pages to crawl or an index page that links to all of the course and unit pages.
For an effective course finder rich metadata would need to be added to the course pages for anything that should be treated as a 'field' - e.g. for presentation in the search results, or for use in faceted navigation and rich query completion.
<meta name="course_name" content="Bachelor of Science" />
<meta name="faculty" content="Faculty of Science and Engineering" />
<meta name="course_code" content="SCI001" />
<meta name="campuses" content="Campus 1|Campus 2"/>
<meta name="study_areas" content="Physics|Chemistry"/>
<meta name="entrance_mark" content="73.5" />
<meta name="course_type" content="Undergraduate"/>
<meta name="course_schedule" content="Full time"/>
Any metadata fields that contain multiple values should be delimited with a vertical bar character.
e.g. <meta name="course_schedule" content="Part time|Full time" />
for a course that is offered as both full time and part time.