Information Retrieval (CLIR)
The objective of the cross-language information retrieval research
is to develop robust algorithms for search across many languages.
Languages studied have been Arabic, Chinese, French, German, Italian, Japanese,
Russian and Spanish, and, of course, English. We are participating in three
major international evaluations of cross-language information retrieval:
the TREC CLIR track (currently English
and French queries to search Arabic news document collections), the
NTCIR Asian language retrieval workshop, and the Cross
Language Evaluation Forum (CLEF) of the European Union.
Cross-language information retrieval rests upon two major foundations:
robust monolingual retrieval algorithms and linguistic resources (bi-lingual
dictionaries, machine translation software, etc) to translate from the
query language to the document collections language. Two levels of
CLIR are distinguishable: bi-lingual retrieval between pairs of languages
and multi-lingual retrieval where a topic in one language is sent
against document collections in several different languages and one wishes
to retrieve a language-independent ranked list of documents. Other
items which can affect performance are language-specific stemmers, morphology,
phrase identification and translation, and decompounding of composite words
(German, for example).
We have studied the effect of using parallel corpora for building bi-lingual
vocabulary lexicons, algorithms for word-boundary detection (segmentation)
is Chinese and Japanese text, and the application of entry vocabulary technology
to query expansion for domain-specific collections which have been manually
indexed using controlled vocabularies and thesauri.
Our current Multi-Lingual Information Retrieval (MULIR) entry vocabulary
index maps from English Library of Congress Subject Headings (LCSH) to
words and phrases in over 100 languages and vice versa. This prototype
was created from over ten million records of the University of California
MELVYL online library catalog.
Papers & Reports