Multilingual facilities available without the need to specify language
Out-of-the-box Funnelback supports the following capabilities regardless of the query language, without specification of any language-specific options.
- Query matching, as described above.
- Query-biased summaries.
- Query term highlighting in presented results.
- Spelling suggestions (did you mean).
- Query blending.
- Best bets
- Faceted navigation.
- Contextual navigation. (Unlikely to work well with CJKT languages).
Because users may not have the capability to generate accented letters (or may not know how), Funnelback defaults to making unaccented query words match accented forms as well. For example 'Universite de Neuchatel' will also match 'Université de Neuchâtel". This behaviour can be configured off, if required.
Capabilities specific to particular languages
Some search functions are specific to a language. Obvious examples are linguistic operations such as stemming (e.g. hiboux -> hibou in French), decompounding (e.g. gezundheitswissenschaft -> gezundheit + wissenschaft in German) and alternate orthographies for accented letters (e.g. Māori, Mäori, Maori and Maaori in New Zealand). Less obvious is the need for language-specific behaviour when sorting search results by title or metadata field (e.g. In Czech, accented letters are grouped with the unaccented form and 'ch' sorts between 'h' and 'i'.
Funnelback supports language (or country) specific sorting using the locale mechanisms supplied by the operating system. You specify the language or locale to be used by means of a 'lang=' option or CGI parameter. For example, 'lang=cs' or 'lang=cs_CZ' will activate the Czech sorting rules.
Note that use of a particular locale may require prior installation of a specific language pack on the server(s) to be used.
If a 'lang=' parameter is specified then a language-specific stemmer and stopword list will be used if stemming and stopword elimination are invoked, if available. In Funnelback version 12.0, language specific stemmers and stopword lists are available for:
Funnelback also includes stopword lists (but not yet stemmers) for
Additional language-specific resources have been and will be developed in response to client demand.
Note that stemming generally does not substantially improve result quality and sometimes causes harm. Funnelback generally performs well in languages (such as Norwegian, Gaelic, Hungarian, Turkish etc.) for which we do not have a specific stemmer.
CJKT - Chinese, Japanese, Korean and Thai
Funnelback indexes are generally word-based which causes a potential problem for languages which have no specific word boundaries. The most prominent examples of these are the CJKT languages.
Funnelback's indexer and query processor (PADRE) addresses this problem by detecting text in character sets associated with the CJKT languages and indexing this as unigrams and overlapping bigrams. If A-E represent CJKT characters and the sequence ABCDE is encountered in a document or in a query, PADRE indexes A, B, C, D, E (unigrams), and AB, BC, CD, DE (overlapping bigrams). Because of this approach, certain operators are not supported for CJKT queries and special arrangements are made to support query term highlighting in summaries etc.
Note that in version 12.0 there is no longer any need to use -cjkt or -bigrams switches required in previous versions (and they are no longer permitted).
Chinese: Simplified and Traditional
When indexing Chinese content it is advisable to use the indexer option -unimap=tosimplified (or -unimap=totraditional) to convert all content into either simplified or traditional characters. Queries submitted to an index in which characters have been mapped will automatically be mapped in the same way.
Non-English User Interfaces
Language-specific access to a multilingual search facility can be achieved by defining language-specific interface templates. For example, example.com could provide a Japanese interface to search by defining a 'japanese' template (a relatively simple process) and adding '&form=japanese' to the search URL.
As of Funnelback version 12.0, an alternative mechanism is provided which allows localisation of a single template file. See Modern UI localisation guidelines for details.
Configuration files which specify locale-dependent versions of messages such as 'search' and 'xxx fully matching' are named for the locale. When results are presented, localised messages are extracted from the configuration file nominated by your browser's 'accept_language' string or by an explicit 'lang=' parameter in the search URL.
Administration and Reporting Interfaces
Currently, Funnelback's administration and reporting interface is only available in English.
No Cross-Language Capabilities
Currently Funnelback does not provide any specific facilities to support the retrieval of documents in one language X when the query is expressed in another Y.
In some cases this may happen because a single document is written in X but contains sufficient content or metadata in Y to match the query, OR is referenced by anchortext or other annotations in language Y.
Table of specific language capabilities
|Latin America Spanish||y||y||y||y||y||y||y||y||0.5||y||y||es||y|
- Contextual navigation relies on a number of English-specific heuristics. Comparable heuristics have not been developed for other languages. However, we have observed that on other European languages, the ConNav suggestions are sometimes useful and could potentially be made more useful with configuration. Due to the lack of word boundaries the same is not true of CJKT languages. Accordingly, the European languages have been rated '0.5' and CJKt as 'no'.
- Funnelback does not yet have extensive experience with some of the languages in the list and the rating of 'y' is based on the generality of our underlying multilingual capability rather than on thorough testing.
- Funnelback uses locales defined by the underlying operating system. In our experience the POSIX locale mechanism supported on Linux and related operating systems has broader coverage and we would recommend its use. In the last resort it allows us to define new locales and modify existing ones if required. Generally speaking, there is little need for the search interface to distinguish between country variants of a language. For example, en_US, and en_AU have their differences but stemming and sorting rules are basically the same.