Home
Description
Publications

Available Resources
Text Acknowledgements
Related links


Events


CLaRK System

CLaRK System Online Manual


Bulgarian dialects'
electronic archive




eXTReMe Tracker

 

 

 

 

 

 

 

AN ELECTRONIC ARCHIVE OF THE BULGARIAN DIALECTS

(based on folklore sources)

    Aim of the project.

    The aim of the project is the creation of an electronic archive of Bulgarian dialectal texts from folklore sources. This archive will be used in the study of the dialectal riches of the Bulgarian language with the tools of computational linguistics and on different language levels – phonetic, morphological, syntactic and lexical. In Bulgaria, work on the creation of such archives is of relatively recent date and only covers the literary language. Because of its specific character (the spontaneous development of the dialectal system, together with a lack of conscious standardization), the Bulgarian dialectal material closely reflects phenomena typical of the historical development of the Bulgarian language: traces of a Balkan substratum, relation to the Old Bulgarian language, similarities with and differences from other Slavonic languages, effects of language contacts within the Balkan peninsula. This is particularly true of the language of folklore, which is to a very great degree preserved from literary influence. That is why the language of folklore, even though it has a supradialectal character - inasmuch as it reflects peculiarities of wider dialectal areas, of specific dialects, rather than the specific peculiarities of the dialects of separate settlements – is an important source of linguistic investigation (Cf. Êî÷åâ Èâ. Ìíîãîàñïåêòíîñò íà ïðîáëåìà çà äèàëåêòíîòî. – Åçèê è ëèòåðàòóðà, 1979, êí. 1, 55-60). Folklore reflects those peculiarities of folk speech which are related to folk mytho-poetic consciousness. It is thus possible to find, in the language of folklore and especially folk poetry, words and forms which appeared in the processes of folk etymology. Due to the phonetic and semantic assimilation of words and forms, the texts preserve ancient forms and patterns which are unknown to common speech and specific variants which, if traced back, could give valuable information on the people's cultural history.

    As a result of the investigation of the dialectal material with the methods of computational linguistics, new data will be presented on major problems of Bulgarian linguistics – the development towards analyticity, the changes in the word order and syntactic structure of a language, the development of the category of definiteness, the development of the semantic categories in the system of the Bulgarian verb. The archive will also be useful for the study of problems of ethnolinguistics and folklore.

    Methods of work.

    Recent developments in the are of computational corpus linguistics are used for the storage and processing of the electronic archive of Bulgarian dialects. Two software products are used: The package of language processing tools “The Linguist’s Workbench”, developed at the Linguistic Modelling Laboratory and the Institute for the Bulgarian Language at the Bulgarian Academy of Sciences and funded by the Bulgarian National Science Foundation and the “Open Society Foundation” Sofia.

     1. The Workbench consists of the following interrelated programmes:

     BUILD – an indexing programme allowing the creation of alphabetical and frequency word lists and supplying information on the statistical characteristics of a text. LEM/POS – a programme for semi-automatic lemmatization and tagging. CONC – a concordancer displaying the co-occurrence patterns of word-forms and lemmas with 10 environment positions accessible for alphabetic or frequency sorting. TREE – a programme supporting the manual construction of treebanks and their automatic querying. MIX – a programme for semi-automatic alignment of parallel texts allowing the alignment and viewing of translation equivalents or different text versions in the same language.

    2. CLaRK is an XML-based software system for corpora development.

     It incorporates several technologies: 1) XML technology; 2) Unicode; 3) Regular Cascade Grammars; 4) Constraints over XML Documents.

     For document management, storing and querying, we chose the XML technology because of its popularity and its ease of understanding. The core of CLaRK is an XML Editor, which is the main interface to the system. Besides the XML language itself, we implemented an XPath language for navigation in documents and an XSLT language for transformation of XML documents.

     For multilingual processing tasks, CLaRK is based on an Unicode encoding of the information inside the system. There is a mechanism for the creation of a hierarchy of tokenisers. They can be attached to the elements in the DTDs and in this way there are different tokenisers for different parts of the documents.

     The basic mechanism of CLaRK for linguistic processing of text corpora is the cascade regular grammar processor. The main challenge to the grammars in question is how to apply them on XML encoding of the linguistic information. The system offers a solution using an XPath language for constructing the input word to the grammar and an XML encoding of the categories of the recognised words.

     Several mechanisms for imposing constraints over XML documents are available. The constraints cannot be stated by the standard XML technology. The following types of constraints are implemented in CLaRK: 1) finite-state constraints - additional constraints over the content of given elements based on a document context; 2) number restriction constraints - cardinality constraints over the content of a document; 3) value constraints - restriction of the possible content or parent of an element in a document based on a context. The constraints are used in two modes: checking the validity of a document regarding a set of constraints; supporting the linguist in his/her work during the building of a corpus. The first mode allows the creation of constraints for the validation of a corpus according to given requirements. The second mode helps the underlying strategy of minimisation of the human labour.

    The data of the electronic archive will be gradually increased and expanded; it will incorporate texts from different areas of the Bulgarian language territory.

    The creation of the electronic archive of the Bulgarian dialects will be supported by historical and etymological research. This is necessitated by the specifics of the dialectal material, which forms a territorial variant of the national language where language peculiarities have been developed that can only be compared to the system of the modern Bulgarian literary language on the basis of the etymological investigation and the setting out of the phonetic, morphological and other types of change that have taken place.

    Sponsors and participants in the project.

    The project is carried out with funding from the “Open Society” Foundation, Sofia and by a team from the Institute for the Bulgarian Language (IBL) and the Central Laboratory for Parallel Processing(CLPP) at the Bulgarian Academy of Sciences: Senior researcher Dr. Luchia Antonova-Vasileva (IBL), Senior researcher Dr. Maria Stambolieva (IBL), Research associate Kiril Ivanov Simov (CLPP). Under the supervision of K. Simov, two students of information science also work on the project – Alexander Dimitrov Simov and Milen Ognianov Kujlekov (Faculty of Mathematics and Information Science, Sofia University “St. Kliment Ohridski”.

   Stages of the Project.

    In the year 2001 work on the Electronic Archive of the Bulgarian Language was centered round the automatic processing of the texts from the “Veda Slovena” collection – a collection written in a dialect from the region of Gotse Delchev and Drama. The publication first appeared in 1874: Âåäà Ñëîâåíà. Áúëãàðñêè íàðîäíè ïåñíè îò ïðåäèñòîðè÷íà è ïðåäõðèñòèÿíñêà äîáà. Îòêðèë â Òðàêèÿ è ìàêåäîíèÿ è èçäàë Ñòåôàí È. Âåðêîâè÷. êí. I, 1874, Áåîãðàä; êí. II, 1881, Ñ. Ïåòåðáóðã. Although it has been the object of special interest on the part of the research and cultural circles, this collection, containing Bulgarian epic songs which have a historical and ritual subject-matter and which are close to folklore in their nature and form, was only republished in Bulgarian in 1997, by the “Open Society” Foundation.