GoURMET parallel corpora for low-resource languages
Monolingual and parallel corpora for a number of
low-resource languages (such as Swahili, Turkish,
Amharic and Kyrgyz, among others) crawled as part of
the Universtat d'Alacant contribution to the GoURMET project
(Global Under-Resourced MEdia Translation), funded by
the European Union (grant agreement id 825299). The
corpora are available through the GoURMET webpage.
Download
corpora
GoURMET translation models for low-resource language pairs
Neural machine translation models for the
translation between English and a number of
low-resource languages (such as Swahili, Pastho and
Macedonian, among others) developed as part of the
Universtat d'Alacant contribution to the GoURMET project
(Global Under-Resourced MEdia Translation; grant
agreement id 825299). Dockerised transaltion models are
available through the GoURMET webpage.
Download
translation models
Morphological segmentation using Apertium resources
Free/open-source tool for using Apertium resources for the
segmentation of texts. Useful as a pre-processing step
before using BPE for training neural machine
translation systems. Funded by the EU through the
GoURMET
project (grant agreement id 825299).
Download
LinguaCrawl: Top-level domain crawler
Free/open-source tool implemented in Python3 to
crawl a number of top-level domains to download any
text documents in the languages specified by the user.
Funded by the EU through the GoURMET project
(grant agreement id 825299).
Download
LASERtrain (language-agnostic sentence embeddings)
Free/open-source piece of software that reproduces
the architecture described by Artetxe and
Schwenk (2018, 2019) to train language-agnostic
sentence embeddings. Funded by the EU through the
GoURMET
project (grant agreement id 825299).
Download
IMPACT-es diachronic corpus
Diachronic corpus of historical Spanish that
compiles over one hundred books -containing
approximately 8 million words- in addition to a
complementary lexicon which links more than 10 thousand
lemmas with attestations of the different variants
found in the documents. Released under an open Creative
Commons by-nc-sa license.
Download : Related paper
ruLearn: toolkit for the automatic inference of shallow-transfer rules for MT
Free/open-source toolkit for the automatic inference
of rules for shallow-transfer MT from scarce parallel
corpora and morphological dictionaries. ruLearn allows
to build machine translation systems for
under-resourced language pairs because it avoids the
need for human experts to handcraft transfer rules and
requires only a few hundred parallel sentences. Ther
rules inferred can be used for rule-based MT as well as
together with a hybridisation strategy for integrating
linguistic resources into phrase-based statistical
machine translation (see Rule2Phrase).
Download
: Read
paper
Rule2Phrase: toolkit for integrating shallow-transfer rules into phrase-based SMT
Free/open-source toolkit to enrich a phrase-based
SMT system (Moses) with phrase
pairs generated from the linguistic resources of a
shallow-transfer rule-based MT system (Apertium). A system
built with this toolkit was not outperformed by any
other participant in the shared translation task of the
Sixth Workshop on Statistical Machine Translation (WMT
11) for the Spanish–English language pair.
Download
: Read paper
Gamblr-CAT: word-level quality estimation in TM-based CAT
Free/open-source software to obtain binary quality
estimations at the level of words (also called
word-keeping recommendations) for translation
suggestions produced by a translation memory tool by
using either statistical word alignments or external
sources of bilingual information.
Download
: Read paper
Gamblr-MT: word-level quality estimation in MT
Collection of free/open-source scripts to obtain a
collection of features for word-level MT quality
estimation using external sources of bilingual
information.
Download
: Read paper
DocTrans: document translation retrieval based on SMT techniques
Free/open-source piece of software implementing a
method based on SMT techniques to retrieve documents
which are a plausible translation of a given source
text. The method provides the terms to use in a query
to retrieve the document translation of the source
document provided as input. In combination with a text
search engine like Apache Lucene it can
be used for translation document alignment. It relies
on the free-/open-source SMT system Moses and was last
tested with revision 2281.
Download
: Read paper
Apertium-tagger-training-tools: target-language-driven POS tagger trainer
Free/open-source package for the unsupervised
training of hidden-Markov-model-based POS taggers
involved in MT. It uses information, not only from the
source language, but also from the target language; to
this end the Apertium MT platform is
used. After training a file containing the
hidden-Markov-model parameters is produced; this file
can be directly used within the Apertium MT
platform.
Download :
Read paper
Apertium-morph: using morphological information with Apache Lucene
Free/open-source package providing a set of tools
and Java classes that allow the Apache Lucene text
search engine to use morphological information to index
and search. To that end, the linguistic resources
developed for the Apertium MT platform are
used to extract morphological information while
indexing.
Download : Read paper