Language support

Multilingual facilities available without the need to specify language

Funnelback supports the following capabilities regardless of the query language, without specification of any language-specific options.

  • Query matching, as described above.

  • Query-biased summaries.

  • Query term highlighting in presented results.

  • Spelling suggestions (did you mean).

  • Query blending.

  • Synonyms

  • Best bets

  • Faceted navigation. (metadata facets use complex queries incompatible with Chinese, Japanese, Korean and Thai (CJKT) languages).

  • Contextual navigation. (Unlikely to work well with CJKT languages).

Accent conflation

Because users may not have the capability to generate accented letters (or may not know how), Funnelback defaults to making unaccented query words match accented forms as well. For example 'Universite de Neuchatel' will also match 'Université de Neuchâtel". This behaviour can be configured off, if required.

Capabilities specific to particular languages

Some search functions are specific to a language. Obvious examples are linguistic operations such as stemming (e.g. hiboux -> hibou in French), decompounding (e.g. gezundheitswissenschaft -> gezundheit + wissenschaft in German) and alternate orthographies for accented letters (e.g. Māori, Mäori, Maori and Maaori in New Zealand). Less obvious is the need for language-specific behaviour when sorting search results by title or metadata field (e.g. In Czech, accented letters are grouped with the unaccented form and 'ch' sorts between 'h' and 'i'.

Funnelback supports language (or country) specific sorting using the locale mechanisms supplied by the operating system. You specify the language or locale to be used by means of a 'lang=' option or CGI parameter. For example, 'lang=cs' or 'lang=cs_CZ' will activate the Czech sorting rules.

Note that use of a particular locale may require prior installation of a specific language pack on the server(s) to be used.

If a 'lang=' parameter is specified then a language-specific stemmer and stop word list will be used if stemming and sto pword elimination are invoked, if available. Language specific stemmers and stop word lists are available for:

  • Czech

  • English

  • German

  • Finnish

  • French

  • Italian

  • Polish

  • Portuguese

  • Russian

  • Spanish

  • Swedish

Funnelback also includes stop word lists (but not yet stemmers) for

  • Arabic

  • Bulgarian

  • Bengali

  • Farsi

  • Hindi

  • Hungarian

  • Maharati

  • Romanian

Additional language-specific resources have been and will be developed in response to client demand.

Note that stemming generally does not substantially improve result quality and sometimes causes harm. Funnelback generally performs well in languages (such as Norwegian, Gaelic, Hungarian, Turkish etc.) for which we do not have a specific stemmer.

Chinese, Japanese, Korean and Thai (CJKT)

Funnelback indexes are generally word-based which causes a potential problem for languages which have no specific word boundaries. The most prominent examples of these are the CJKT languages.

Funnelback’s indexer and query processor addresses this problem by detecting text in character sets associated with the CJKT languages and indexing this as unigrams and overlapping bigrams. If A-E represent CJKT characters and the sequence ABCDE is encountered in a document or in a query, PADRE indexes A, B, C, D, E (unigrams), and AB, BC, CD, DE (overlapping bigrams). Because of this approach, certain operators are not supported for CJKT queries and special arrangements are made to support query term highlighting in summaries etc.

Please note that certain complex query operators, including some used to implement metadata based facets, are not supported in bigram mode.

Chinese: simplified and traditional

When indexing Chinese content it is advisable to use the indexer option -unimap=tosimplified (or -unimap=totraditional) to convert all content into either simplified or traditional characters. Queries submitted to an index in which characters have been mapped will automatically be mapped in the same way.

Localized search results pages

Implementation of search template localization is the recommended way of delivering a search results page in multiple languages.

Configuration files which specify locale-dependent versions of messages such as 'search' and 'xxx fully matching' are named for the locale. When results are presented, localised messages are extracted from the configuration file nominated by your browser’s 'accept_language' string or by an explicit 'lang=' parameter in the search URL.

If you only need to provide a template in a specific language the simplest method is to just edit the template file and update all of the labels to reflect the language of choice.

Administration and reporting interfaces

Funnelback’s administration and reporting interface is only available in English.

No cross-language capabilities

Funnelback does not provide any specific facilities to support the retrieval of documents in one language X when the query is expressed in another Y.

In some cases this may happen because a single document is written in X but contains sufficient content or metadata in Y to match the query, OR is referenced by anchortext or other annotations in language Y.

Table of specific language capabilities

Supported in all listed languages: Crawling, Indexing, Querying, Synonyms, Display, Spelling Suggestions, Blending, Faceted Navigation based on URLs, Dates and Gscopes.

Language ConNav Sort Stem Locale Best Bets Metadata Faceted Navigation

English

y

y

y

en

y

y

French

0.5

y

y

fr

y

y

Italian

0.5

y

y

it

y

y

German

0.5

y

y

de

y

y

Korean

no

y

n/a

ko

y

no

Japanese

no

y

n/a

jp

y

no

Chinese (traditional)

no

y

n/a

zh_TW

y

no

Chinese (simplified)

no

y

n/a

zh_CN

y

no

Malaysian

0.5

y

no

ms

y

y

Spanish

0.5

y

y

es

y

y

Latin America Spanish

0.5

y

y

es

y

y

Portuguese

0.5

y

y

pt

y

y

Bahasa Indonesia

0.5

y

no

id

y

y

Vietnamese

0.5

y

no

vi

y

y

Russian

0.5

y

y

ru

y

y

Dutch

0.5

y

no

nl

y

y

Thai

no

y

n/a

th

y

no

Notes:

  • Contextual navigation relies on a number of English-specific heuristics. Comparable heuristics have not been developed for other languages. However, we have observed that with other European languages, the contextual navigation suggestions are sometimes useful and could potentially be made more useful with configuration. Due to the lack of word boundaries the same is not true of CJKT languages. Accordingly, the European languages have been rated '0.5' and CJKt as 'no'.

  • Funnelback does not yet have extensive experience with some of the languages in the list and the rating of 'y' is based on the generality of our underlying multilingual capability rather than on thorough testing.

  • Funnelback uses locales defined by the underlying operating system. In our experience the POSIX locale mechanism supported on Linux and related operating systems has broader coverage and we would recommend its use. In the last resort it allows us to define new locales and modify existing ones if required. Generally speaking, there is little need for the search interface to distinguish between country variants of a language. For example, en_US, and en_AU have their differences but stemming and sorting rules are basically the same.

© 2015- Squiz Pty Ltd