Invertigation Area

Construction of a Cross-Language Information Retrieval system for the Web

This project, with Code: Fit-150500-2002-416, is subsidized by Ministry of Science and Technology (Project PROFIT), and has a period from July 2002 to December 2003. The participant organizations are:

	University of Alicante: Antonio Ferrández Rodríguez Rafael Muñoz Guillena Jesús Peral Cortés José Luís Vicedo González Andrés Montoyo Guijarro Fernando Llopis Pascual Rafael Romero Jaén David Tomás Díaz Julio Martínez Larrosa José Francisco Navarro Martínez
	University of Jaén: Luís Alfonso Ureña López Manuel García Vega Fernando Martínez Santiago María Teresa Martín Valdivia Manuel Carlos Díaz Galiano Víctor Rivas Santos
	University of Sevilla: José A. Troyano Jiménez Víctor J. Díaz Madrigal Vicente Carrillo Montero Francisco José Galán Morillo Luisa M. Romero Moreno José Miguel Cañete Valdeón Javier Barroso Tristán Fernando Enríquez de Salamanca Ros

Description

The primary target of the project is to construct a information retrieval system (IR) in which a series of tools of processing of the natural language are integrated. This IR tries to improve the traditional IR systems that work on the Web from three points of view:

Firstly, this system will be able to work on different languages, that is to say, independently of the language in which the question of the user appears, it will give back a relation of documents, which they could also be in different languages, making this process in a transparent way for the user.
Secondly, new kinds of knowledge are going to be used, which the traditional IR systems do not contemplate, like lexical, syntactic and other analysis.
Finally, the quality of the information to give back will be improved, since it will return just the text snippets where the information required by the user appears, instead of returning whole documents (that is to say, a Question Answering application).

The scientific and technological primary target of the project is focused on the Cross-Language Information Retrieval research field. This field appears like an extension of traditional Information Retrieval that works on an only language, that is to say, the question as the documents on which it looks for the information are in the same language. The extension to "multilingual" supposes that the question as much as the documents do not need to be in the same language. For that reason, the objective of this project is to make information searches on a document collection that can be in different languages, independently of the language in which the question is made. Although it is anticipated to develop a technology that facilitates the incorporation of new languages in the future, initially we will focus on the languages of the European Economic Community, delimiting the application of techniques of Natural Language Processing (NLP) to English and Spanish. Within this field of investigation, it also appears an extension to Question Answering applications, in which the result is not the complete document, but the text snippet that contains the answer of the user. One of the objectives of the project is fitted indeed in this field, although it will be only applied on English and Spanish, since for this type of applications it is made indispensable to apply techniques of NLP, that increase the degree of understanding of the texts on which the search is made. In addition, another one of the scientific objectives of this project is centered within the field of investigation of the Computational Linguistic, concretely in the one of the NLP, in which it it is tried to add new sources of intelligence to the process of the search, which will allow to improve the precision and quality of the results to give back. The information that is expected to incorporate would be the lexical, syntactic analysis, resolution of linguistic problems and word sense desambiguation. This kind of information is not contemplated in the traditional IR systems available at the moment, that usually are based solely on information referring to the occurrences of words in documents. For example, these systems discard the pronouns as non-content words, therefore the information that is referred by these probonouns is also discarded. When we propose a previous resolution of this type of anaphoras, we will be able to improve the precision of the searches because the referring information is not discarded. The set of documents on which it will work will not be restricted, although the later specialization to restricted dominions is anticipated, in which it is easy to think that the precision of the system would improve. It will be taken like the data set from entrance on which information will be looked for, like heterogenous and not structured documents, that is to say, in natural language, adding to the capacity multinlingual described previously.

Tools to be used in the project

GEOGRAPHIC LOCALIZER.

INFORMATION RETRIEVAL SYSTEM.

WORD SENSE DISAMBIGUATION USING
SPECIFICATION MARKS METHOD.
This application uses specification marks method for word sense disambiguation. It has been created at the University of Alicante by Andrés Montoyo with the collaboration of May Calle and Sonia Vázquez.
- Run the application

SLOT UNIFICATION PARSER FOR ANAPHORA RESOLUTION (SUPAR).
Natural Language Processing System that includes: POS-tagger, partial parser and automatic anaphora resolution.

TREE-TAGGER: A LANGUAGE INDEPENDENT PART-OF-SPEECH TAGGER.
The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It has been successfully used to tag German, English, French, Italian, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.
- Run the application

Publications derived from the project

Llopis, F.; Vicedo, J.L.; Ferrández, A.
"IR-n system at CLEF-2002"
Working Notes for the Clef 2002. Lecture Notes in Computer Science. 2002
- PDF File

Fernando Llopis; José L. Vicedo; Antonio Ferrández; Manuel C. Díaz; Fernando Martínez.
"Universities of Alicante and Jaen at iCLEF"
Working Notes for the Clef 2002. Lecture Notes in Computer Science. 2002
- PDF File

Vicedo, J.L.; Llopis, F.; Ferrández, A.
"University of Alicante Experiments at TREC-2002"
Eleventh Text REtrieval Conference (TREC-11). Gaithersburg, Maryland (EEUU). November 2002
- PDF File

Montoyo, A., Suarez A. Palomar, M.
"Combining supervised-unsupervised methods for Word Sense Disambiguation"
Lecture Notes in Computer Science. Springer-Verlag CICLING´02. Volumen: 2276. pp. 156-164. Mexico. 2002
- PDF File

Muñoz R., Montoyo A.
"Definite description resolution enrichment with Wordnet domain labels"
Lecture Notes in Artificial Intelligent. Springer-Verlag. IBERAMIA´02. Volumen: 2527. pp. 645-654. Sevilla. 2002
- PDF File

Montoyo A., Romero R., Vazquez S., Calle C., Soler S.
"The Role of WSD for Multilingual Natural Language Applications"
Lecture Notes in Artificial Intelligent. Springer-Verlag. TSD´02. Volumen: 2448. pp. 41-48. Czech Republic. 2002
- PDF File

Soler S., Montoyo A.
"A Proposal for WSD Using Semantic Similarity"
Lecture Notes in Computer Science. Springer-Verlag. CICLING´02. Volumen: 2276. pp. 165-167. Mexico. 2002
- PDF File

Muñoz R., Saíz-Noeda M., Montoyo A.
"Semantic Information in Anaphora Resolution"
Lecture Notes in Artificial Intelligent. Springer-Verlag. PORTAL´02. Volumen: 2389. pp. 63-70. Portugal. 2002
- PDF File

Peral, J.; Ferrández, A.
"IL MT System. Evaluation for Spanish-English Pronominal Anaphora Generation"
Mexican International Conference on Artificial Intelligence MICAI-2002. Lecture Notes in Artificial Intelligence 2313:146-155. Mérida, Yucatán (Mexico). 2002
- PDF File

Martínez Santiago, F.; Martín Valdivia, M.T.; Ureña López, L.A.
"SINAI on CLEF 2002: Experiments with Merging Strategies"
In Working Notes of Cross Language Evaluation Forum (CLEF 2002). Rome, Italia. 2002
- Fichero PDF

Martínez Santiago, F.; Ureña López, L.A.
"Propuesta de un Sistema de Recuperación de Información Multilingüe" In proceedings I Jornadas de Tratamiento y Recuperación de Información. pp 141-148. Valencia. 2002
- Fichero PDF

Martínez F., Martín M. T., Rivas V. M., Díaz M. C., Ureña L. A. "Using Neural Networks for Multiword Recognition in IR" In proceedings of Seventh International ISKO Conference. Pp 559-564. Granada. 2002
- Fichero PDF

Martín Valdivia, M.T.; García Vega, M.; Ureña López, L.A. "Resolución de la Ambigüedad Mediante Redes Neuronales" Revista de procesamiento de lenguaje natural No. 28, pp: 215- 222. 2002
- Fichero PDF

Martínez F., Díaz M. C., Martín M. T., Rivas V. M., Ureña L. A. "Aplicación de redes neuronales y redes bayesianas en la detección de multipalabras para tareas IR" In proceedings I Jornadas de Tratamiento y Recuperación de Información. pp 89-96 Valencia. 2002
- Fichero PDF

Any doubt or suggestion to consult, email Antonio Ferrández Rodríguez

Last update: January, 17th, 2002