Paris 2011

Session 1 - Problems and Solutions on Baltic Shores / Problématiques et exemples sur les rivages de la Baltique

Bookmark and Share
Version imprimable
Daiva Šveikauskienė

Treebank of the Lithuanian Language

Voir la video

Résumé

Study of the situation in Lithuanian machine translation revealed, it is not possible to create the statistical machine translation system jet because of the lack of required amount of parallel texts. Thus the status of Lithuanian language is similar to that of the English language in the beginning of the computer – it is no sufficient computer resources. So we are now the best way to go which was the English language 50 years ago – to create rule-based machine translation system. The planed Treebank will serve creation of well working automatic syntactic analysis, which is needed for the rule-based machine translation.

Full text/Texte intégral

Introduction

1At the beginning of the paper I would like to thank the organizers of the Conference for the attention paid for the Baltic States. It is pleasant, that the state, which has the best machine translation in the world, takes care of small neighbouring states and organizes the conference, which will contribute to the increasing the level of computerising of Baltic languages and very to the survival of them. Sincerely thank the organizers for this conference.

2This paper describes the works scheduled for the near feature at the Institute of Lithuanian Language. It is planned to create Treebank of Lithuanian language by annotating 2 ml. word text corpus in the field of journalism. We decided to provide our Treebank freely available on the internet, and as a result we were forced to limit text topics. Our Treebank will not include fiction. However in the feature we plan to have the syntactic annotated text corpus of polite literature too.

Status of the Lithuanian language computerized

3According to the data of the META-NET project Lithuanian language belongs to the least computerized languages of Europe. In the book “The Lithuanian language in the digital age” the information is given on the 30 languages of Europe according to four criteria: speech processing, machine translation, text analysis, and language and text resources. In all four cases the Lithuanian language is attributed to the Group of the worst computerized languages [Vaišnienė, Zabarskaitė 2012: 74].

4This chapter provides a brief description of resources of Lithuanian language. A very important role in language computerization takes machine translation systems, because they are a facility and guarantee of communication with other countries. Therefore the information about the machine translation is described more detailed.

Lithuanian language resources available

5To the available resources of the Lithuanian language belongs corpus of 140 million words, collected at the Vytautas Magnus university in Kaunas. 1 million of them are morphological annotated by hand and 140 million – automatically. Unfortunately the annotated corpus is not freely available. Reason for such limitations is copyright.

6There is no syntactical annotated corpus for Lithuanian language.

7Vytautas Magnus university has parallel corpus of four types1:

  • English-Lithuanian (70 813 parallel sentences),

  • Lithuanian-English (1 614 parallel sentences),

  • Czech-Lithuanian (4 881parallel sentences),

  • Lithuanian-Czech (693 parallel sentences).

8Spoken language corpus consist of 10 hours audio recordings.

9Institute of Lithuanian language developed an electronic dictionary of the Lithuanian language, geo information database2: database of neologisms and others.

10The Website of Vytautas Magnus university provides the automatic morphologic annotator of the Lithuanian language, which is free available. The software is also created for determining of text functions3. Six functions are distinguished: spontaneous expressiveness, narrative, directive, nonspontaneous expressiveness, appellative, descriptiveness.

11There are two machine translation systems are prepared in Lithuanian. One of them is based on statistical method, it is created by TILDE IT, and the other uses rule-based method and it is prepared at the Vytautas Magnus university.

Statistical machine translation

12The quality of the Lithuanian statistical machine translation system is unsatisfactory. Lithuanian-English direction translation system was created by TILDE IT. It is difficult for the Lithuanians to assess the quality of the translations into the English language, because we do not know the English language very well and we can’t see all the errors. Leščinskas [2012: 28] describes the method of TILDE IT and notes:

There are similarities with the Google translation system, both are based on the statistical models of translation.

13Thus the Google translation system uses the same method and translates in both directions, and we can assess the results of its work. The translation can be defined by one word “bad”. Albrektas [2010] notes:

Translation models can operate in phrases, and in more complex structures (for example in syntax trees)

14However the bad translation is produced both for small amount phrases and for sentences, both short and long. For example, the phrase was translated from English language into Lithuanian. The result was following:

skambinti jam melagis

(to phone to him the liar)

15It was the English phrase

calling him liar

16Further in this article the sentences used for examples are taken form Website4. The result of translation of a short sentence was following:

Plutonas buvo nustatyta, kad tik vienas mažas orgamas tūkstančių gyventojų.

(The Pluto it was determined, that only one small organ of the thousands of inhabitants)

17This time the following sentence was translated:

Pluto was found to be just one small body in a population of thousands.

18Consequently the larger amount of translated words and context do not improve the quality of translation. Similar results are obtained by translating of very large sentences. For example, when translated an English sentence, which consist of 39 words, the translated sentence wasn’t more understandable.

Šios schemos, kurios buvo pagrįstos, o ne į babiloniečių aritmetinis geometrijos, galiausiai užtemimas į babiloniečių teorijos, sudėtingumą ir išsamumą, bei paskyrų dauguma astronominių judėjimų pastebėtų žemės plika akimi.

(These schemes, which were based, but not into arithmetical of the geometry of Babylonians, ultimately the eclipse into the of theory of Babylonians, complexity and completeness, and majority of allocation of astronomical movements observed with the bald eye of the Earth.)

19This sentence in the English language is following:

These schemes, which were based on geometry rather than the arithmetic of the Babylonians, would eventually eclipse the Babylonians’ theories in complexity and comprehensiveness, and account for most of the astronomical movements observed from Earth with the naked eye.

20Clarity of the translated text does not increase, when two sentences are translated, which are close to each other in the coherent text. For example, the translated text into Lithuanian was following:

Kai kurie iš jų, įskaitant Kvavaras, Sedna ir eris pranašauja populiariąją spaudą, kaip dešimtą planetoje, jei jo nėra, tačiau gauti plačiai mokslinį pripažinimą. Eris pranešimas 2005 objektas 27%masyvesnė už Plutoną, sukurta reikalingumą ir visuomenės noras su oficialiu vienos planetos apibrėžimo.

(Some of them, including Quaoar, Sedna and eris herald the popular press, as the tenth in the planet, if he is not, however to receive widespread the scientifical acknowledgment. The Eris the message 2005, the object 27% she is more massive than Pluto, it is created, the need, and the wish of the society with official of one planet of definition.)

21This translation was derived from following English sentences:

Some of them, including Quaoar, Sedna, and Eris were heralded in the popular press as the tenth planet, failing however to receive widespread scientific recognition. The announcement of Eris in 2005, an object 27% more massive than Pluto, created the necessity and public desire for an official definition of a planet.

22It is clear that such translations can’t be used for serious work. Translations made by Google system are most suitable for crossword puzzle book with the task: Who will the first guess, what the sentence was translated?

Rule-based machine translation

23The Vytautas Magnus university has rule based machine translation system. This system was created for English-Russian translation and later it was applied to the Lithuanian language. The results of its work are not better than in case of statistical machine translation. The work of rule-based translation system can be illustrated by the same examples, which were used in the previous section. Translation of the short sentence this time is following:

Buvo manyta, kad plutonas buvo tik vienas mažas kūnas tūkstančių gyventojuose.

(It was believed, that Pluto was only one small body in the inhabitants of thousands.)

24That was the translation, received from the sentence:

Pluto was found to be just one small body in a population of thousands.

25When translating phrase,

calling him liar

26the result was following:

paklausimas jo melagis

(the inquiry his the liar)

27The translation of the long sentence with 39 words was following:

Šitie planai, kurie buvo pagrįsti geometrija, o ne babiloniečių arithmetika, galų gale užtemdys babiloniečių teorijas sunkume ir visapusume, ir sudarys daugumą astronominių judėjimų, laikytų nuo žemės plika akimi.

(These plans, which were based on the geometry, and not on arithmetic of Babylonians, in the end it will obscure the theories of Babylonians in the heaviness and in the comprehensevity, and it will form the majority of astronomical movements, which were hold from Earth with the bald eye.)

28We remind of the original sentence:

These schemes, which were based on geometry rather than the arithmetic of the Babylonians, would eventually eclipse the Babylonians’ theories in complexity and comprehensiveness, and account for most of the astronomical movements observed from Earth with the naked eye.

29Thus it can be concluded, the rule-based machine translation system is not working satisfactorily. This time the main reason of bad results the lack of opportunities to upgrade the system. We have only two possibilities to improve the translation: to expand the main vocabulary and to extend a special vocabulary. In principle we can change the rules of the translation system, but we do not have professionals, who would be well aware of the system in order to change the rules and that way to improve the results of the translation.

Planed works

30Review of the Lithuanian language machine translation systems work can be concluded that neither statistical machine translation nor rule-based machine translation gives satisfactory results. On the statistical method it must be said, that the Lithuanian language situation now is as it was with the English language when Weaver suggested the use of statistical methods. This approach was abandoned because of the dearth of machine-readable texts [Brown at all. 1990]. Today we lack parallel texts and must go the way, which was chosen for the English language 50 years ago – to create rule-based machine translation system. The existing rule-based machine translation system works badly. Better results could be obtained if the system is being developed for the Lithuanian language, rather than adapting to it for other languages designed systems. When creating of self-developed system we will have a possibility to revise and to improve it.

Treebank

31It is planned at the institute of Lithuanian Language to annotate text corpus of amount 2 million words. The corpus will be annotated using semi-automatic method. For this purpose will be used the software for automatic syntactic analysis of Lithuanian language, which was created at the Institute of Mathematics and Informatics. During the annotation the software will be improved taking into account the mistakes. The most important task now is to prepare a well-functioning automatic syntactic analysis of Lithuanian language, which can be used by creating of rule-based machine translation system. Using the TRANSFER method the second step is the syntactic analysis. It can be expected, that the syntactic annotation of the corpus will serve to improve the automatic syntactic analysis. By annotating the corpus it will be intended to provide as much as possible information, which is relevant to translation. The following format of annotation will be used:

  • Serial number of the word in the sentence;

  • The word form used in the sentence;

  • Morphological data about the word form (gender, number, tense and so on);

  • Lemma;

  • Features of lexical semantic, which can determine the syntactic function of the word;

  • Syntactic function of the word;

  • Direct syntactic relations of the word;

  • Deep cases;

  • Restored parts of the sentence (subject, expressed by a first and second person pronoun and copula of the predicate), which were omitted;

  • Substitute for the pronoun;

  • Restored the missing words in elliptical sentences;

  • Stylistic information for words, which belong to a certain style;

  • Mistakes and the suggested right version

32All the information listed above will be stored inside the computer. Syntactic annotated corpus will be free available in Internet. The syntactic structure will be provided with words given in the nodes and syntactic functions as its labels. The type of syntactic relation will be given as label of the arrow. By clicking on the word user can see the information listed above.

Machine translation systems

33By assessment of the current situation it is suggest that a more promising for the Lithuanian language is the rule-based machine translation. The adapting of the machine translation, which was created for other languages did not give satisfactory results. Therefore it is necessary to create the rule-based machine translation system for the Lithuanian language. One of the possible ways is to join the multilingual system, for example, PLAIN [Hellwig 1988] and create the Lithuanian part of the system. The other way is to create a new rule-based machine translation system.

Language differences and problems

34When the translated sentence is not understandable this is unacceptable translation quality. But sometimes the obtained understandable sentence is correct in regard to the meaning, however it is not fluent. For the high Quality translation it is needed to take note of it. We must see what can be done to improve the quality of translation in this respect. It can be seen from the small experience by translating sentences from English into Lithuanian, the typical cases which leave the foreign language stamp in our sentences. Rimkutė [2008: 6] described the English-Lithuanian rule-based machine translation:

When analysing of machine translation errors and shortcomings the iinfluence of source language to the target language is one of the most obtrusive things. This time it is the influence of English language to the Lithuanian Language. One example is the usage of the nominal phrases instead of verb phrases.

35Other examples were following: The Lithuanian language is characterised by very often usage of diminutive. It is not typical for other languages. The usage of diminutive is very rare in the German Language. The specific feature of the German language is the usage of great number of words in compounds, for example “Schreibmaschinenpapierblatt”. In the Lithuanian language and in the English language too the compounds of more than two words are not so often the case. Hutchins and Somers [1982:37] described the translation of prepositions and verb control. In the German language the verb control is “blicken auf” and “weisen auf”, in the English language “look at” and “point to”; in the German language “arbeiten an” and “glauben an”, in the English language “work at” and “believe in”. In Lithuanian the adjectives have control too. The adjective “panašus” (looks like) requires the preposition “į” + accusative as extension. In the German language the adjective ähnlich requires dative without preposition.

The means to improve the quality of translation

36It would be very useful after the correct translation in regard to meaning was obtained, to do the stylistic adjustment of the translated sentence:

  • The nominal phrases, which are uncharacteristic for the Lithuanian language, must be replaced with the verb phrases;

  • Several nouns, which perform the syntactic function of attribute, must be replaced by a compound;

  • The German compounds, the equivalents for which the Lithuanian language did not has, must be replaced by a sequence of several words as attributes;

  • The diminutive suffixes must be eliminated in the German language because they are not characteristic for it.

  • The prepositions must be translated together with the verb;

  • It would be very useful to translate two nodes and one arc in the dependency tree together.

  • Conclusions

  • There is no required amount of parallel texts for the statistical machine translation in Lithuania

  • Lithuanian language situation is the same as it was with the English language at the beginning of computer era: we do not have sufficient amount of parallel texts. Thus we now have to do what was done to the English language 50 ears ago – to crate rule-based machine translation.

  • The rule-based machine translation system needs automatic syntactic analysis. The annotation of the corpus will serve its creation, because by using semi-automatic method it is possible to improve the program taking in account the mistakes.

  • Machine translation systems, which are created for other languages and adapted to the Lithuanian language work badly. We do not have opportunities to improve their work.

  • It is necessary to develop our own rule-based machine translation system, which would be created for the Lithuanian language so we can improve it.

37­

Bibliography

Albrektas, Tomas (2010), « Apie statistinį mašininį vertimą – About the statistical machine translation »: http://blog.lituanika.lt/2010/02/apie-statistini-masinini-vertima.html

Brown, Peter, Cocke, John, Della Pietra, Stephen (1990), « A Statistical Approach to Machine Translation » in Computational Linguistics, Volume 16, Number 2, June, pp. 79-85

Hellwig, Peter (1988), « Chart parsing according to the slot and filler principle. COLING’88 : Proceedings of the 12th conference on computational linguistics – Volume 1, pp. 242-244

Hutchins, John, Somers, Harold (1982), « An Introduction to Machine Translation ». London: Academic Press.

Leščinskas, Liutauras (2012), « Maironio mašinos dar nevers, bet… – The machines will not translate Maironis jet, but…» in Verslo klasė, June, pp. 28-30.

Rimkutė, Erika (2008), « Mašininis vertimas: finišo tiesiosios dar nematyti – Machine translation : the last lap is still invisible »

Vaišnienė, Daiva, Zabarskaitė, Jolanta (2012), « The Lithuanian Language in the digital age » META-NET White paper series, Rehm, Georg, Uszkoreit, Hans (editors). Berlin, Heidelberg : Springer-Verlag.

Footnotes/Notes

1 http://donelaitis,vdu.lt/ main.php ?id=4&nr=1_2

2 http://lvvgdb.lki.lt/vietovardziai/Default.aspx?pid=1

3 http://donelaitis,vdu.lt/ main.php ?id=4&nr=8

4 http://en.wikipedia.org/wiki/Planett6tntrntrnrny

To cite this document/Pour citer ce document

Daiva Šveikauskienė , «Treebank of the Lithuanian Language», Tralogy [En ligne], Tralogy II, Session 1 - Problems and Solutions on Baltic Shores / Problématiques et exemples sur les rivages de la Baltique, mis à jour le : 21/05/2014,URL : http://lodel.irevues.inist.fr/tralogy/index.php?id=210

Quelques mots à propos de :  Daiva Šveikauskienė

Institute of Lithuanian Language
daiva.sveikauskiene@lki.lt