Automatic Translation of Nominal Compounds from English to International Conference on Natural Language Processing
Centre for Language Technologies Research Centre
International Institute of Information Technology
Automatic Translation of English Nominal Compound in Hindi Abstract
translation‟ and so on2. Rackow et al. (1992)
has rightly observed that the two main issues
correctly in the target language involves a)
correctness in the choice of the appropriate
target lexeme during lexical substitution and
b) correctness in the selection of the right
method comprises of the following steps:
target construct type. The issue stated in (b)
parallel corpus of English and Hindi that we
English corpus (3) Finding the appropriate
found that English nominal compounds can
using WSD tool (4) Lexical substitution of
„Hindu texts‟ hindU SastroM, „milk
production‟ dugdha utpAdana
significantly improves the performance of
temperature‟ kamare ke tApamAn
distinct from all the previous works done
„nature cure‟ prAkrtik cikitsA, „hill
1.0 Introduction
The words prAkrtik and pahARI being
adjectives derived from prakriti and
frequently occurring expression in English1.
wax work mom par ciwroM „work
preceeding noun the modifier as found in
„cow milk‟, „road condition‟, „machine
body pain SarIr meM dard „pain in body‟
1 Tanaka and Baldwin (2004) reports that the
BNC corpus (84 million words: Burnard (2000))
has 2.6% and the Reuters has (108M wo rds:
2 A nominal compound may be constituted of a
Rose et al. (2002)) 3.9% of bigra m no minal
more co mp lex structure as „customer satisfaction
indices‟, „social service department‟ and so on.
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
one described in this paper follow a template based corpus search approach. However, the
Hand luggage haat meM le jaaye
present system distinctly differs from the
jaane vaale saamaan „luggage to be
aforementioned works for the analysis stage.
Our system, unlike others, attempts to select
the correct sense of nominal components by
However, no definite clue is available in the
running a WSD system on the SL data. As a
data that helps one in selecting the right
construction type of Hindi for translating a
translation candidates to be searched in the
reduced. Translation of nominal compound
system attempting to translate a corpus will
run across NCs with high frequency, but that
Selection from target language Hindi (2)
Extraction of NCs from English corpus (3)
occurring only once). The upshot of this for
Finding relevant sense of the components of
compounds are too varied to be able to pre-
compile in an exhaustive list of translated
NN compounds. The system must be able to
deal with novel NN compounds on the fly.
The next section describes the data in some
Building an automatic translation system for
detail. In section 3, we review earlier works
that have followed similar approaches as the
language (SL) English to the target language
present work. Our approach is described in
(TL) Hindi thus becomes a very challenging
section 4. Finally the result and analysis is
could achieve an accuracy of 45% with the
same test data that we have used to evaluate
At the time of taking up the present project
Hindi. When an NC is translated in genitive
English-Hindi parallel corpora in order to
construction in Hindi, the translator could
identify the distribution of various construct
return the correct result 10% of cases. For
types which English NC are aligned to. We
other cases such as when NC translated as
took a parallel corpora of around 50,000
Adjective noun pair or as a single word, the
sentences in which we got 9246 sentences
performance of Google translator is poor.
(i.e. 21% cases of the whole corpus) that has
This paper presents the architecture of a “Nomin
various translations is given in Table 1.
that has been able to give an accuracy of
We have also come across some cases where
an NC corresponds to a paraphrase construct
test data. We limit our discussion to English
for which we have not given a count in this
two word nominal compounds in this paper.
table. There are .08% cases (see table 1)
The approach adopted to build the system
when an English NC becomes a single word
has a close resemblance to the approaches
described in Bungum and Oepen (2009) for
either be a simple word as in („cattle
dung‟ gobar) or a compounded word such
translation and Tanaka and Baldwin (2004)
as „blood pressure‟ raktacApa, „transition
(English to Japanese nominal compound and
vice versa). All these works including the
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
(Rackow et al. (1992)) and b) corpus search
based probabilistic approach (Bungum and
Oepen (2009) (henceforth B&O), Tanaka
and Baldwin (2004) (henceforth T&B)).
Rackow et al. tried to set a mapping between
the head noun of source language and target
language in terms of some grammatical and
selecting the right lexical item for the target
Table 1 : Distributi on of translations of
B&O and T&B has close similarity to ours
English NC fr om an English Hi ndi par allel
as far as the template generation and the
procedure of corpus search is concerned.
The above table records major translation
represent various construct types of the
types. There are 1208 cases (approximately
templates in a huge corpus. The two works
is not translated but transliterated in Hindi.
differ in using different strategy for ranking
They are mostly technical terms, names of
of the possible translated candidates that are
found in the corpus. We have adopted the
T&B proposal for ranking. T&B suggests
The figure given in Table 1 is a report of the
ranking candidate translation based on target
empirical study performed on English-Hindi
essentially corpus frequency. They develop
translation templates that represents the
construct types of Hindi (as in table 1). In
(Corpus-based translation quality) metric”
which extracts frequency counts from the
templates are used for searching possible
target language corpus (for the details see
translation in Hindi raw corpus. From table
1, we come to know that the frequency of
both B&O and T&B disregard local contexts
and does not attempt to identify the sense of
second highest construction is the genitive
construct. Parallely we have performed a
They have, on the other hand, taken into
study with Hindi informants to find out how
account of all possible translations of the
corpus search. In this way the number of
syntactic genitive construct even when it can
have other more accurate translation. Our
experiment shows that a nominal compound
is well accepted as a genitive construct in
Hindi in 59% of cases. This is an interesting
compound in the given context, that is, the
finding which we have used in designing the
sentence in which it has occurred. In this
other works referred to in this section.
3.0 Related Works
While working on the automatic translation
4.0 Preparation of Data and Approach
of English nominal compound to Hindi, we
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
following stages: a) Preparation of data and
translated construct type of English NC in
template generation b) Determining sense of
the component nouns in the given context, c)
inspected and generalized into translation
templates. As shown in section 2, the two
dictionary, d) corpus search using translation
templates <E15 E2> <H1 H2> and <E1
templates and e) Ranking of the possible
E2> <H1 kA6 H2> are the most frequent
ones. The other interesting candidate is
Adjective noun phrase in Hindi. Hindi has a
4.1 Preparation of Source Language Data
formation. In this work we have identified
Two sets of language data are prepared for
the work. The first set is a parallel corpus of
4.3 Sense Selection for Source Language
The context determines the sense of a given
identified in the Hindi target language3. The
component nouns are taken independently,
they might represent more than one sense.
sentences of English on which we have run
For each sense the English word might be
tagger not only gives part of speech of the
words but also outputs the lemma for each
bilingual dictionary. Let me explain the
word. The lemma is required in the later
complexity of lexical substitution with data
stage for searching the word in the wordnet.
nominal compounds are strictly restricted to
be two consecutive noun construction type.
harm (deaths, injuries, and property damage)
resulting from crashes of road vehicles‟
processing. These sentences are manually
translated into Hindi and used half of it as
sentence (a) and (b) are „border area‟ and
„road safety‟ respectively. All four words
can be used in more than one sense as given
4.2 Generation of Translation Templates
One of the most important subtasks in this
3 In order to execute this task we have used a JAVA based interface “Sanchay” that has been
developed in-house. Using an interface to do
5 E stands for English and H stands for Hindi
this task helped us to ma intain consistency in
6 kA is a genitive marker in Hindi. It has variants
kI and ke. Therefore <H1 kA H2>, <H1 ke H2>
tagging the corpus of 1.7M words. It gave an
and <H1 kI H2> form three translation
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009 Table 2 : Number of Senses Listed in Wor dne t
wordnet. Since that was not available to us,
For each sense there exists a synset which
we have maintained the following strategy.
We first acquire all possible translations for
consider all words for all senses of the
possible dictionary resources. Then we take
component nouns and attempt to translate all
of them using a bilingual dictionary the
translations to all English words of a synset,
number of translation candidates will be
if there is one. For example, we got the
following translations for the two synsets
searching for those candidates that are not
<„road‟, „route‟> from bilingual dictionaries:
relevant for the English NC in the given
context. In order to avoid the proliferation
of data, we have chosen to use a WSD tool.
al.) on our data for the purpose. This tool
Table 4: Translation using a bilingual
specifies the wordnet sense id for each noun
dictionary
component within NC as shown in table 3:
From table 4, we find out that maarg, saDak, raastaa are common translation for „road‟
and „route‟. Once the Hindi equivalents are
translation candidates which are searched in
equivalent(s) is not found for all member
common translation is available. The worst
Table 3: Output of WSD tool
members of „border‟ as well as „safety‟ we
The third column of table 3 presents the
synset associated with the sense selected by
the WSD tool. Once the synsets are acquired
translations of all synset members one by
in this process the translation for each word
one for generating the translation templates.
in the synset is obtained from a bilingual
dictionary. Once we look into a bilingual
dictionary, again we may come across many
Translation Candidates
equivalents of a word which do not match to
the sense id selected for that word. For
We have performed the corpus search on a
example, the word „border‟ (a member of
Hindi indexed corpus of 28 million words.
the synset of „border‟) has one equivalent
For ranking, a reference ranking based on
jhaalar in the bilingual dictionarythat is
the frequency of occurrence of the translate
used in the domain of „decoration‟ and not
candidates in full in the TL corpora is taken
„location‟. We would like to discard such
as baseline. To improve on the baseline, a
equivalents. Otherwise the whole attempt of
stronger ranking measure is borrowed from
using WSD tool on the source language side
Baldwin and Tanaka (2004). It rates a given
will be lost. The ideal situation would have
translation candidate according to corpus
been to have a mapping from the synset id
translation and its parts in the context of the
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
measure is called interpolated CTQ metric
The motivation for this approach is two fold:
that extracts the frequency counts from the
a) a word occurs mostly in its default sense
which is listed as the first sense in any
lexicon; b) if the input word is not available
in a bilingual dictionary for substitution, a
CTQ (w1H , w2H , t) = αp(w1H , w2H , t) +
synset gives us other equivalent words. This increases the robustness of the system. The
third method is the one we have adopted for
the present task – using a WSD tool on the
of occurrence of template t with w1 and w2
appropriate sense of the given word in that
as its instances and βp(w1H , t)p(w2H ,
context. The purpose of trying out various
t)p(t) is the probability of occurrence of
translation template t with w1 as its instance
at one time multiplied by the probability of
brings in any improvement to the overall
occurrence of translation template t with w2
performance of the translator tool. The table
as its instance at another time multiplied by
below shows that it does. The pre-processed
the occurrence of translation template t.
Naturally the first term will be given higher
substitution is not humanly analyzed data
priority than the second term. The result
but is actually obtained as the output of
presented in the next section will show that
the incorporation of frequency of occurrence
of βp(w1H , t)p(w2H , t) has distinctly
has produced 80% accurate case for nominal
compound disambiguation7. The results of corpus search of the translation candidates
are given in the following two tables. The
5.0 Result and Analysis
baseline frequency model performs in the
various experiments performed as part of
improvement in performance as we go from
baseline ranking method to CTQ method of
nominal compounds into Hindi equivalents
and the result obtained for each method is presented at table 1 and table 2. As part of
the first method we have not done any word
words of source language NC; on the contrary we have straightaway used the
Table 5: Ranking using Baseline Frequency
Hindi equivalents. For the second method,
7 It is interesting to note that the accuracy
been selected as default sense and all the
reported for the WordNet-SenseRelate output on
members of synset of the first sense have
general data is 58%. When we tested the tool for
been substituted using a bilingual dictionary.
nomina l co mpound, it gave an accuracy of around 80% for the same.
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009 6.0 Conclusion and Future Work
improved as shown in the following table:
This paper describes the architecture of a
translating English nominal compound into
translated into Hindi. However no clue is
available to determine which type of Hindi
have, therefore, adopted a corpus search
found out that adjectival templates are hard
from noun is a complex derivational process
in Hindi. It does not only involve attaching
an adjectival suffix on the noun but also
many a time requires a change in the vowel
The recall of this experiment was very low.
of the stem. In the present work, we have
includes the correct generation of adjectival
verify on the development data whether the
form from the modifier nouns so that correct
templates for „Adjective Noun‟ construct
corpus search can legitimately be translated
as a genitive construct. We found that the
approach is that a translation if it exists in
the corpus will never be missed. Therefore
Therefore we incorporated this as a default
accuracy of translation will depends largely
translation case for our system. Whenever a
corpus search for a translation candidate
searched for the translation candidates.
fails, we assign a genitive translation for that
nominal compound. This results in a steep
7. References
improvement in recall although the precision
falls down a little. We ran the experiment
substitution methods. The result is reported
Helmut Schmid. 1994. Probabilistic Part-of-
International Conference on New Methods in Language Processing. Manchester, UK.
Lou Burnard. 2000. User Reference Guide for the British National Corpus. Technical
Table 7: Ranking after inclusion of de fault translati on (X kA Y, X k I Y, X ke Y as
Pierrette Bouillon, Katharina Boesefeldt,
templates) Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
nouns in a unification-based MT system. In Proc. of the 4th Conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany. Siddharth Patwardhan, Satanjeev Banerjee and
SenseRelate::TargetWord – A Generalized Framework for Word Sense Disambiguation. Proceedings of the ACL Interactive Poster and Demonstration Sessions, Ann Arbor, MI
Sparck Jones, K. 1983. "So what about parsing compound nouns?," in Automatic Natural Language Processing, K. Sparck Jones and Y. A. Wilks, eds., Ellis Horwood, Chichester, 164--168.
Su Nam Kim, Timothy Baldwin: Automatic Interpretation of Noun Compounds Using WordNet Similarity. IJCNLP 2005: 945-956
interpretation of nominal compounds. In Proc. of the 1st Conference on Artificial Intelligence (AAAI-80).
Timothy Baldwin and Takaaki Tanaka. 2004. Translation by Machine of Complex Nominals: Getting it right. In Proceedings of the ACL04 Workshop on Multiword Expressions:
Tanaka, Takaaki and Timothy Baldwin. 2003b. Translation Selection for Japanese-English
Proceedings of Machine Translation Summit IX, New Orleans, LO, USA.
Ulrike Rackow, Ido Dagan, Ulrike Schwall. 1992. Automatic Translation of Noun Compounds. COLING 1992, 1249-1253
Zouhair Maalej, English-Arabic Machine Translation of Nominal Compounds, in Proceedings of the Workshop on Compound Nouns: Multilingual Aspects of Nominal Composition. Geneva: ISSCO, pp. 135–146, 1994.
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
REGISTER TODAY! CREDIT INFORMATION SATISFACTION GUARANTEED California Nurses: CMI Education Institute Inc., is a provider approved by the California Board of All registrations must include payment or signed purchase order. The Heart of the Matter Fibrinolytics Registered Nursing, Provider Number 6538 for 6.0 contact hours. Full attendance is required. No Walk-ins ar
Hailed as “a brilliant cellist” by the legendary Mstislav Rostropovich, Sergey Antonov went on to prove his mentor’s proclamation when in 2007 he became one of the youngest cellists awarded the gold medal at the world’s premier musical Olympiad, the quadrennial Interna-tional Tchaikovsky Competition. Antonov’s entry into this elite stratum of sought-after classical artists has already p