Xml Cfg

Introduction

Name

xml.cfg

Location

~/conf/collection/

Description

Defines mappings of elements and attributes of XML documents to a metadata class

Format

The xml.cfg consists of a mapping, one per line

class,content,display-name,element-path

where:

class

is the Funnelback metadata class, or a special value of '+' or '-' (see below).

content

is a "1" or "0" flag to indicate whether the metadata is also to be indexed as part of the document's content or not, or one of the additional content flag values (see below).

display-name

is the display name for the class (not used at present).

element-path

is the XML path to the element (see below).Lines beginning with a hash (#) are treated as comments and ignored.

Entire document

Additionally, it is possible to specify an element within the XML document to index rather than the entire document. To do so add the following line to xml.cfg :

document,element-path

where:

document

is the literal string document.

element-path

is the XML path to the element that should be indexed (see below).This directive tells the indexer to terminate the current document, and to start a new one. It's usually used to split an XML file into multiple indexed documents.

URL

It is also possible to associate each XML document with a given URL within the index, using data from the XML document itself. To do so add the following line to xml.cfg :

docurl,element-path

where:

docurl

is the literal string 'docurl'.

element-path

is the absolute element path to an element containing the URL that should be associated with the document (see below). Please note that setting the docurl to an XML attribute is not supported.

File type

In case the XML documents map to binary files it's possible to specify the filetype of the original file using data from the XML document. To do so add the following line to xml.cfg:

docfiletype,element-path

where:

docfiletype

is the literal string docfiletype.

element-path

is the absolute element path to an element containing the file type that should be associated with the document (see below). The file type should be a file extension without the leading dot (i.e. doc, ppt, xls, ...)

Special field characters

Two special fields characters, + and - can be used.

+,,,element-path
-,,,element-path

+

Signals that the element contains a inner document. The inner document should be XML encoded or enclosed in CDATA, to ensure the outer XML is correctly formatted. Padre will index the inner document as though it is a part of the outer document, however padre may not use the XML parser. Padre will guess the inner document's type and use the appropriate parser, for example if the inner document starts with a root <html> tag padre will use the HTML parser. If the inner document is well formed XML and starts with a XML declaration <?xml version="1.0" encoding="UTF-8" /> padre will use the XML parser.

Note that if the inner document contains HTML entities but doesn't start with a root <html> tag, it will not be detected as HTML. The inner document in the example below will not be detected as HTML:

<root>
  <name>Example</name
  <inner>This is &lt;strong&gt;an example&lt;/strong&gt;</inner>
</root>

This one will:

<root>
  <name>Example</name
  <inner>&lt;html&gt;This is &lt;strong&gt;an example&lt;/strong&gt;&lt;/html&gt;</inner>
</root>

-

Signals that the element should be indexed as the document content, and that any other unmapped elements won't be indexed at all.

element path

The element path is pattern to match the pseudo-path to an element, or attribute.

  • If the path begins with / then the path is absolute (it matches from the top of the XML structure).
  • If it begins with // it is unanchored (it can be located anywhere in the XML structure).

XML attributes can be used by adding @attribute to the end of the path.

Attribute values are not supported in element path definitions.

Example element paths

  • /items/item - VALID
  • //item/keywords/keyword - VALID
  • //keyword - VALID
  • //image@url - VALID
  • /items/item[@type=value] - NOT VALID

Caveats

  • Element paths are case-sensitive
  • Each element may only be mapped to a single class (i.e. it is not possible to map the same element to both the x and y classes)

Additional content flags

In addition to the "1" or "0" flag for the content column described above, there are three advanced modes that can be defined for a metadata class.

Interpretation of field content

  • CDATA tags can be used with fields that contain reserved characters, or the characters should be HTML encoded.
  • Fields containing multiple values should be delimited with a vertical bar character, or the field repeated with a single value in each repeated field.

eg. The indexed value of //keywords/keyword and //subject below would be identical.

<keywords>
    <keyword>keyword 1</keyword>
    <keyword>keyword 2</keyword>
    <keyword>keyword 3</keyword>
</keywords>

<subject>keyword 1|keyword 2|keyword 3</subject>

Troubleshooting

No items are returned in search results after successful indexing

  • If the XML file(s) are indexed as a local collection, or have URLs that match the $SEARCH_HOME the check_url_exclusion indexer option may need to be set to off, or a different exclusion path defined.

Certain mapped fields are not indexed

  • Check that the element-paths are valid and supported (see above)
  • Ensure that the element-path matches exactly (element-paths are case sensitive)
  • Ensure that the element-path is only mapped to a single metadata class.

XML files that contain multiple items

  • If only a single item is detected check that the element path used for the document value is correct, and check the Step-Index.log to see if the file was detected as XML. If the file was not detected as XML it may be necessary to set the -forcexml indexer option.
  • If not all the items from the file are detected it may be necessary to increase the size of the indexer's decompression chamber by setting the -chamb indexer option.

Examples

Example 1: Multiple records per XML file

In this example, we have several fictional events provided in a single XML file, and we want to index the events' titles, descriptions and venue name before displaying them alongside the event's start and end dates and times.

<?xml version="1.0" encoding="UTF-8"?>
  <events>
    <event id="38016">
      <eventId>38016</eventId>
      <eventTitle>Free CPR and board rescue training for surfers"</eventTitle>
      <eventUrl>http://example.com/events/surf-rescue-saturday</eventUrl>
      <eventCostType></eventCostType>
      <eventDateSubmitted>27 Apr 2015</eventDateSubmitted>
      <longdescription><![CDATA[Where: Surf Life Saving Club House Cost: Free!
        Bookings are essential.Come to a free training session and learn how to assist in an
        ocean emergency. You will learn CPR and board rescue skills that could save a life! All
        surfers are welcome to participate, from the recreational
        grassroots boardriders to professional surfers.]]></longdescription>
      <venueName></venueName> <venueAddress1>"</venueAddress1>
      <venueSuburb></venueSuburb>
      <venuePostcode>"</venuePostcode>
      <venuephone></venuephone>
      <categoryTitle></categoryTitle>
      <price>"</price>
      <date_start>18th Apr 2015"</date_start>
      <date_start_time>09:00 am</date_start_time>
      <date_start_doorsOpenTime>09:00 am</date_start_doorsOpenTime>
      <date_end_date>18th Apr 2015</date_end_date>
      <date_end_time>12:00 pm</date_end_time>
    </event>
    <event id="38017">
      <eventId>38017</eventId>
      <eventTitle>Motorcycle training course for beginners"</eventTitle>
      <eventUrl>http://example.com/events/motorcycle-training-course-novice-riders</eventUrl>
      <eventCostType></eventCostType>
      <eventDateSubmitted>27 Apr 2015</eventDateSubmitted>
      <longdescription><![CDATA[Where: Council car park, corner of Jones and
        Smith StreetsCost: $100Build
        on your motorcycle skills and improve your riding safety at this training
        course. The course is suitable for learners and people who have provisional
        motorcycle licences.]]></longdescription>
      <venueName></venueName> <venueAddress1>"</venueAddress1>
      <venueSuburb></venueSuburb>
      <venuePostcode>"</venuePostcode>
      <venuephone></venuephone>
      <categoryTitle></categoryTitle>
      <price>"</price>
      <date_start>2nd May 2015"</date_start>
      <date_start_time>09:30 am</date_start_time>
      <date_start_doorsOpenTime>09:30 am</date_start_doorsOpenTime>
      <date_end_date>2nd May 2015</date_end_date>
      <date_end_time>03:30 pm</date_end_time>
    </event>
    ...
  </events>
</xml>

The xml.cfg for this collections is:

PADRE XML Mapping Version: 2
# Define which element to split documents on
document,/events/event
# Define which element from a document should be used as a URL
docurl,//eventUrl

# Indexed fields
title,1,,//eventTitle
description,1,,//longDescription

# Non-indexed fields
datestart,0,,//date_start
dateend,0,,//date_end
d,0,,//eventDateSubmitted

Note that the collection.cfg was changed to set the following query processor options:

query_processor_options=-SM=both -SF=[title,description,datestart,dateend,d]

That is, the result summary mode (SM) is to use both snippets and metadata. The summary fields (SF), in the results are to include the classes title, description, datestart, dateend and d.

Example 2: One record per XML file

This example uses the SVG icons from the tango icons set. SVG files are expressed as individual XML files and we want to index the icons' title, author, license and keywords (metadata classes t, a, z and s).

A sample icon follows:

  <metadata id="metadata4">
    <rdf:RDF>
      <cc:Work rdf:about="">
        <dc:format>image/svg+xml</dc:format>
        <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
        <dc:title>Addess Book - New</dc:title>
        <dc:creator>
          <cc:Agent>
            <dc:title>Julius Caesar</dc:title>
          </cc:Agent>
        </dc:creator>
        <dc:source>http://funnelback.com</dc:source>
        <dc:subject>
          <rdf:Bag>
            <rdf:li>address</rdf:li>
            <rdf:li>contact</rdf:li>
            <rdf:li>book</rdf:li>
          </rdf:Bag>
        </dc:subject>
        <cc:license
           rdf:resource="http://creativecommons.org/licenses/by-sa/2.0/" />
      </cc:Work>
   ...

The xml.cfg for this collections is:

PADRE XML Mapping Version: 2
# Define author mapping
a,1,,//dc:creator/cc:Agent/dc:title
# Define keywords mapping
s,1,,//dc:subject/rdf:Bag/rdf:li
# Define title mapping
t,1,,//dc:title
# Define license mapping
z,0,,//cc:license@rdf:resource

Note that the collection.cfg was changed to set the following query processor options:

query_processor_options=-SM=both -SF=[a,s,t,z]

That is, the result summary mode (SM) is to use both snippets and metadata. The summary fields (SF), in the results are to include the classes a, s, t and z.

Spelling suggestions

If you wish to include the content of fields within spelling suggestions (and also organic auto completion) then they will need to be added to the spelling.suggestion_sources configuration.

See also

top