Built-in filters: CSV to XML filter
CSVToXML filter converts CSV documents to multiple XML documents, where each record in the CSV results in one XML document.
To enable the filter add
CSVToXML to the filter chain. Documents will only be filtered if the document has the mime type
text/tab-separated-values. You may need to write a custom filter to alter the document type.
If all documents being gathered are CSV documents you can use the
ForceCSVMime filter before this filter (e.g.
ForceCSVMime:CSVToXML) to set the MIME type to CSV to filter all documents as CSV.
To add the csv-to-xml conversion to the existing filter chain:
<default_filter_chain> is the default value for filter.classes.
To force Funnelback to treat all documents as CSV and convert all entries to XML:
Downloading CSV on web collections
To allow the web crawler to download CSV documents you may need to add
csv to the crawler.non_html option.
Configuring the filter
The filter generally assumes
RFC4180 (skipping blank lines). If your CSV is in another format, you can set filter.csv-to-xml.format.
You can instruct Funnelback to read the headers and use them as element names in the resulting XML by enabling filter.csv-to-xml.has-header in
It is also possible to set a custom header by defining the element names in
collection.cfg using filter.csv-to-xml.custom-header.
If a custom header is intended to overwrite an existing header filter.csv-to-xml.has-header should be set
Tip: Take note of the case of the elements in the header when trying to map metadata classes as they are case sensitive.
URL template for resulting XML documents
You can change the template for the URLs used in the resulting XML documents by setting filter.csv-to-xml.url-template.
CSV to XML conversion example
The CSVToXML filter uses the field names from the CSV file when generating the XML for indexing. Any non-word characters found in the field names are converted to an underscore when generating the XML field names. The XML fields preserve the case of the CSV field names.
For example the CSV file:
"First Name","Last Name","Role","Home Page" "John","Smith","Plumber","http://directory/smith_john.html" "Joe","Bloggs","Consultant","http://directory/bloggs_joe.html" "Fred","Nerk","Teacher","http://directory/nerk_fred.html"
is converted to a three XML documents that have the following form (first document shown):
<csvFields> <First_Name>John</First_Name> <Last_Name>Smith</Last_Name> <Role>Plumber</Role> <Home_Page>http://directory/smith_john.html</Home_Page> </csvFields>
The fields can be mapped to metadata using the normal rules for XML field mapping.
For example the following XML configuration could be used to index the CSV fields:
The document url is optional - the documents will get an auto-assigned URL when the file is split.
Metadata class configuration:
|Metadata class name||Metadata class type||Source fields||Source type|
The following additional XML special configuration can optionally be set if one of the fields contains a URL that should be the target URL when a result for row is clicked.
See: Document URL