Choosing a data source
Background
This article details the recommended data source types to use for various types of content.
API generated content
API generated content is content returned by accessing an API. The content is usually returned in a structured format such as JSON or XML via a REST style web call.
Always try to use a web data source to gather this content before exploring other options.
A web data source is suitable if:
-
the content is fully returned via HTTP/HTTPS
-
the content is returned with no authentication, or basic HTTP, SAML/form-based authentication.
Pagination in API responses and the following of links in non-standard fields can be achieved by altering the web crawler’s link extraction rules.
The built-in JSON to XML filter should be enabled when the source data is JSON.
Using Squiz Connect or a Funnelback custom data source type is suitable if:
-
custom logic is required to access the content such as multi-layered API requests or multiple API requests that need to be aggregated
-
the repository has authentication requirements that are beyond what a web data source supports.
CSV
CSV data, if accessible via an HTTP/HTTPS URL, should be gathered using a web data source and the built-in CSVtoXML
filter to convert the CSV to XML appropriate for indexing by Funnelback. A web data source is suitable if you have one or more URLs containing CSV data that can be added to the web data source start URLs list.
A custom data source or Squiz Connect can be used to gather the CSV data if:
-
The CSV data is retrieved via a multi-step process such as API requests.
-
Access to the CSV data is protected by authentication that is not supported by the web crawler.
Regardless of which method is used to fetch the CSV data, the built-in CSVtoXML
filter should be used to convert the CSV to XML appropriate for indexing by Funnelback.
Further modifications can then be made using the filter framework, operating on the XML output from the CSVtoXML
filter.
JSON
JSON data, if accessible via an HTTP/HTTPS URL, should be gathered using a web data source and the built-in JSONtoXML
filter to convert the JSON to XML appropriate for indexing by Funnelback.
A custom data source or Squiz Connect should be used if:
-
The CSV data is retrieved via a multi-step process such as API requests.
-
Access to the CSV data is protected by authentication that is not supported by the web crawler.
Regardless of which method is used to fetch the JSON data, the built-in JSONtoXML
filter should be used to convert the JSON to XML appropriate for indexing by Funnelback.
Further modifications can then be made using the filter framework, operating on the XML output from the JSONtoXML
filter.
The modify JSON data plugin can be used to make modifications and transformation to the JSON data. This filter, if used, must be applied in the filter chain before the JSONtoXML filter, which converts the JSON data into XML for indexing.
|
SQL database
The rows of a table of results returned by an SQL database query can be indexed as individual result items.
If the SQL database is accessible from the Funnelback server (and there is an appropriate JDBC driver) then a database data source should be used.
If the SQL database is not accessible (e.g. because of security requirements) then a web data source can be used if there is a web accessible location from which the results can be fetched. This could be via an export process which runs the SQL query and converts the results into an XML file (ideally) or by writing a simple web script that Funnelback can access remotely that queries the database and returns the results as dynamically to Funnelback, ideally as XML. XML is the preferred delivery format as it doesn’t require any further conversion by Funnelback. JSON and CSV would also work but would require extra configuration to convert to XML.
Websites
A web data source should be used:
-
For any website that is not authenticated. This includes intranet sites that are delivered via a content management system.
-
For any website requiring basic HTTP authentication, NTLM authentication or form-based authentication if the page content is not personalised for the user and no document level security is required.
For websites (mostly intranets) that require document level security it is likely a custom connector will be required to index the site.
Recommended data source type: web
XML
XML data, if accessible via an HTTP/HTTPS URL, should be gathered using a web data source.
If the XML needs to be split into individual records:
-
If no transformation of the XML is required and the XML fields to extract are contained in simple XPaths then use the built-in XML splitting that can be configured via the XML processing options.
-
If the XML needs to be transformed then use the filter framework to split the XML (use the
SplitXml
filter available from Github: See: Split XML filter chained with a custom filter that makes the required modifications to the XML. See the section on filtering best practices below.
If unavailable via HTTP/HTTPS and there is another appropriate collection type for accessing the XML then use this to gather the XML and the filter framework to make the required modifications to the XML.
If using a custom collection to gather the XML data (e.g. from an API) then use the filter framework to make the required modifications to the XML.
Recommended data source type: web or custom