Construction of a Cross-Language Information Retrieval
system for the Web
This project, with Code: Fit-150500-2002-416, is subsidized
by Ministry of Science and Technology
(Project PROFIT), and has a period from July 2002 to December 2003. The
participant organizations are:
The primary target of the project is to construct a information retrieval
system (IR) in which a series of tools of processing of the natural language
are integrated. This IR tries to improve the traditional IR systems that
work on the Web from three points of view:
Firstly, this system will be able to work on different languages, that
is to say, independently of the language in which the question of the user
appears, it will give back a relation of documents, which they could also
be in different languages, making this process in a transparent way for
the user.
Secondly, new kinds of knowledge are going to be used, which the traditional
IR systems do not contemplate, like lexical, syntactic and other analysis.
Finally, the quality of the information to give back will be improved,
since it will return just the text snippets where the information required
by the user appears, instead of returning whole documents (that is to say,
a Question Answering application).
The scientific and technological primary target of the project is focused
on the Cross-Language Information Retrieval research field. This field
appears like an extension of traditional Information Retrieval that works
on an only language, that is to say, the question as the documents on which
it looks for the information are in the same language. The extension to
"multilingual" supposes that the question as much as the documents do not
need to be in the same language. For that reason, the objective of this
project is to make information searches on a document collection that can
be in different languages, independently of the language in which the question
is made. Although it is anticipated to develop a technology that facilitates
the incorporation of new languages in the future, initially we will focus
on the languages of the European Economic Community, delimiting the application
of techniques of Natural Language Processing (NLP) to English and Spanish.
Within this field of investigation, it also appears an extension to Question
Answering applications, in which the result is not the complete document,
but the text snippet that contains the answer of the user. One of the objectives
of the project is fitted indeed in this field, although it will be only
applied on English and Spanish, since for this type of applications it
is made indispensable to apply techniques of NLP, that increase the degree
of understanding of the texts on which the search is made. In addition,
another one of the scientific objectives of this project is centered within
the field of investigation of the Computational Linguistic, concretely
in the one of the NLP, in which it it is tried to add new sources of intelligence
to the process of the search, which will allow to improve the precision
and quality of the results to give back. The information that is expected
to incorporate would be the lexical, syntactic analysis, resolution of
linguistic problems and word sense desambiguation. This kind of information
is not contemplated in the traditional IR systems available at the moment,
that usually are based solely on information referring to the occurrences
of words in documents. For example, these systems discard the pronouns
as non-content words, therefore the information that is referred by these
probonouns is also discarded. When we propose a previous resolution of
this type of anaphoras, we will be able to improve the precision of the
searches because the referring information is not discarded. The set of
documents on which it will work will not be restricted, although the later
specialization to restricted dominions is anticipated, in which it is easy
to think that the precision of the system would improve. It will be taken
like the data set from entrance on which information will be looked for,
like heterogenous and not structured documents, that is to say, in natural
language, adding to the capacity multinlingual described previously.
Tools to be used in the project
GEOGRAPHIC LOCALIZER.
It is an interface to a geographic data base in Natural Language. The
data base stores information about buildings, activities and departments
of the University of Alicante. This information corresponds to the coordinates
of these locations with reference to an aerial photo of the University.
Therefore, the user can ask for the location of a building, and after the
request in Natural Language (in Spanish), it shows an aerial photo of the
university graticulating the asked zone.
INFORMATION RETRIEVAL SYSTEM.
It is an information retrieval system, that from a determined entrance,
either it is complete phrases in natural language or a set of key words,
obtains as exit a document relation ordered according to the relevance
of each one, with respect to the consult. It uses the set of 423 documents
in English that contain the diverse news of the Times newspaper.
WORD SENSE DISAMBIGUATION USING SPECIFICATION MARKS METHOD.
This application uses specification marks method for word sense disambiguation.
It has been created at the University of Alicante by Andrés Montoyo
with the collaboration of May Calle and Sonia Vázquez.
SLOT UNIFICATION PARSER FOR ANAPHORA RESOLUTION (SUPAR).
Natural Language Processing System that includes: POS-tagger, partial parser and automatic anaphora resolution.
TREE-TAGGER: A LANGUAGE INDEPENDENT PART-OF-SPEECH TAGGER.
The TreeTagger is a tool for annotating text with part-of-speech and lemma information.
It has been successfully used to tag German, English, French, Italian, Greek and
old French texts and is easily adaptable to other languages if a lexicon and
a manually tagged training corpus are available.
Fernando Llopis; José L. Vicedo; Antonio Ferrández; Manuel C. Díaz; Fernando Martínez.
"Universities of Alicante and Jaen at iCLEF"
Working Notes for the Clef 2002. Lecture Notes in Computer Science. 2002
Vicedo, J.L.; Llopis, F.; Ferrández, A.
"University of Alicante Experiments at TREC-2002"
Eleventh Text REtrieval Conference (TREC-11). Gaithersburg, Maryland (EEUU). November 2002
Montoyo, A., Suarez A. Palomar, M.
"Combining supervised-unsupervised methods for Word Sense Disambiguation"
Lecture Notes in Computer Science. Springer-Verlag CICLING´02. Volumen: 2276. pp. 156-164. Mexico. 2002
Muñoz R., Montoyo A.
"Definite description resolution enrichment with Wordnet domain labels"
Lecture Notes in Artificial Intelligent. Springer-Verlag. IBERAMIA´02. Volumen: 2527. pp. 645-654. Sevilla. 2002
Montoyo A., Romero R., Vazquez S., Calle C., Soler S.
"The Role of WSD for Multilingual Natural Language Applications"
Lecture Notes in Artificial Intelligent. Springer-Verlag. TSD´02. Volumen: 2448. pp. 41-48. Czech Republic. 2002
Soler S., Montoyo A.
"A Proposal for WSD Using Semantic Similarity"
Lecture Notes in Computer Science. Springer-Verlag. CICLING´02. Volumen: 2276. pp. 165-167. Mexico. 2002
Muñoz R., Saíz-Noeda M., Montoyo A.
"Semantic Information in Anaphora Resolution"
Lecture Notes in Artificial Intelligent. Springer-Verlag. PORTAL´02. Volumen: 2389. pp. 63-70. Portugal. 2002
Peral, J.; Ferrández, A.
"IL MT System. Evaluation for Spanish-English Pronominal Anaphora Generation"
Mexican International Conference on Artificial Intelligence MICAI-2002. Lecture Notes in Artificial Intelligence 2313:146-155. Mérida, Yucatán (Mexico). 2002
Martínez Santiago, F.; Martín Valdivia, M.T.; Ureña López, L.A.
"SINAI on CLEF 2002: Experiments with Merging Strategies"
In Working Notes of Cross Language Evaluation Forum (CLEF 2002). Rome, Italia. 2002
Martínez Santiago, F.; Ureña López, L.A.
"Propuesta de un Sistema de Recuperación de Información Multilingüe"
In proceedings I Jornadas de Tratamiento y Recuperación de Información. pp 141-148. Valencia. 2002
Martínez F., Martín M. T., Rivas V. M., Díaz M. C., Ureña L. A.
"Using Neural Networks for Multiword Recognition in IR"
In proceedings of Seventh International ISKO Conference. Pp 559-564. Granada. 2002
Martín Valdivia, M.T.; García Vega, M.; Ureña López, L.A.
"Resolución de la Ambigüedad Mediante Redes Neuronales"
Revista de procesamiento de lenguaje natural No. 28, pp: 215- 222. 2002
Martínez F., Díaz M. C., Martín M. T., Rivas V. M., Ureña L. A.
"Aplicación de redes neuronales y redes bayesianas en la detección de multipalabras para tareas IR"
In proceedings I Jornadas de Tratamiento y Recuperación de Información. pp 89-96 Valencia. 2002