The objective of the cross-language information retrieval research is  to develop robust algorithms for search across many languages.  Languages studied have been Arabic, Chinese, French, German, Italian, Japanese, Russian and Spanish, and, of course, English. We are participating in three major international evaluations of cross-language information retrieval: the TREC CLIR track (currently English and French queries to search Arabic news document collections), the NTCIR Asian language retrieval workshop, and the Cross Language Evaluation Forum (CLEF) of the European Union.  

Cross-language information retrieval rests upon two major foundations: robust monolingual retrieval algorithms and linguistic resources (bi-lingual dictionaries, machine translation software, etc) to translate from the query language to the document collections language.  Two levels of CLIR are distinguishable: bi-lingual retrieval between pairs of languages and multi-lingual retrieval where a topic  in one language is sent against document collections in several different languages and one wishes to retrieve a language-independent ranked list of documents.  Other items which can affect performance are language-specific stemmers, morphology, phrase identification and translation, and decompounding of composite words (German, for example).

We have studied the effect of using parallel corpora for building bi-lingual vocabulary lexicons, algorithms for word-boundary detection (segmentation) is Chinese and Japanese text, and the application of entry vocabulary technology to query expansion for domain-specific collections which have been manually indexed using controlled vocabularies and thesauri.

Our current Multi-Lingual Information Retrieval (MULIR) entry vocabulary index maps from English Library of Congress Subject Headings (LCSH) to words and phrases in over 100 languages and vice versa.  This prototype was created from over ten million records of the University of California MELVYL online library catalog.  

