Character set support

Introduction

Funnelback can handle documents and queries encoded in any of the character sets supported by the iconv library. These include UTF-8, ASCII, all of the ISO 8859 variants and many Windows Code Pages. This allows indexing and querying of pages in European, Cyrillic and some Asian languages such as Vietnamese. It also provides support for "CJKT" languages (see below). When indexing, Funnelback looks at the head of each document to determine the encoding of the document, as in <meta http-equiv="content-type" content="text/html; charset=UTF-8">

If no encoding is specified ISO 8859-1 is currently assumed but in future, the default is likely to become UTF-8. If the specified encoding is not supported, no conversion occurs.

If a Funnelback search form is embedded in a page encoded in something other than UTF-8 it is necessary to supply the relevant encoding as a CGI parameter e.g. <input type="hidden" name="enc" value="WINDOWS-1258">

This is not necessary for Funnelback search pages which are by default encoded in UTF-8.

Internally, Funnelback converts documents and queries into the universal character set Unicode/UTF-8 which is capable of representing just about any character in any language. This means that a single collection may comprise documents in a mixture of different encodings and may be searched with a query expressed in a completely different one.

CJKT (Chinese/Japanese/Korean/Thai) indexing and querying

Why are CJKT languages different?

Languages such as CJKT do not provide explicit word-boundaries, which is a problem because search engines such as Funnelback and Google are inherently word based. Algorithms to automatically segment a string of Chinese characters into 'words' are difficult to implement and prone to error. A simpler alternative, bigram indexing, is described below and has been implemented in Funnelback.

Extent of Funnelback support for CJKT.

Funnelback support for languages continues to improve, however we don’t currently provide search and insights dashboards in languages other than English. Furthermore operators such as negation, phrases etc are not supported in CJKT queries.

The level of support for CJKT is believed to be adequate for the following situations:

  1. CJKT documents occurring within a large multi-lingual website, such as those published by Universities and government agencies.

  2. Small CJKT-only installations where the administrator speaks sufficient English to read documentation and use the search dashboard.

Simplified and Traditional Chinese

Funnelback includes some basic support to perform character mappings between simplified and traditional Chinese.

The indexer option -unimap can be used to specify mappings between characters. The tosimplified option converts all traditional Chinese characters in documents into their simplified equivalents. The fact that this mapping has occurred is communicated to the query processor which applies the same transformation to queries. It is also possible to use totraditional to transform in the reverse direction.

So, for example in the People’s Republic of China, the majority of written Chinese on-line uses simplified characters. By specifying -unimap=tosimplified the administrator can convert any traditional characters in documents and queries into simplified equivalents to facilitate better query matching. In this case, all titles, summaries, and metadata for search results will appear in Simplified Chinese.

Bigram/unigram indexing

The indexing technique which has been shown to work best in research evaluations is to consider words to be either individual characters or overlapping bigrams (adjacent pairs of characters). This is the technique used in Funnelback.

Using uppercase letters to represent CJKT characters, the string of characters 'ABCDE' would be indexed as A, AB, B, BC, C, CD, D, DE, E.

Many CJKT documents contain non-CJKT text, such as references to names of products, places and people. Similarly, non-CJKT documents sometimes contain CJKT text. Funnelback indexes the non-CJKT parts normally and switches in and out of bigram/unigram mode as necessary. It is able to do this because it converts all incoming character encodings (such as big5 for Chinese and shift-JIS for Japanese) into Unicode (actually UTF-8) prior to indexing, and knows which Unicode code positions are CJKT.

CJKT query processing

The Funnelback query processor detects sections of the query which consist of CJKT characters and converts them to overlapping bigrams. Other parts of the query are treated normally. If 'BCD' were submitted as a query, it would be processed as BC CD, giving a reasonably high level of match against a document containing 'ABCDE'. If the query were just 'A', it would remain unchanged and the document containing 'ABCDE' would match.

Note that if the query submission form (search box) appears on a web page encoded in a character set other than UTF-8, the form must supply a enc CGI parameter so that the query can be correctly converted into UTF-8. For example:

     ...
     <form action="/s/search" method="GET">
       ...
       <input type="hidden" name="enc" value="shift-JIS"/>
       <input type="hidden" name="collection" value="COLLECTION/>
       <input type="search" name="query"/>
       <input type="submit" value="Go" />
     </form>
     ...