AN ELECTRONIC ARCHIVE OF THE
BULGARIAN DIALECTS
(based on folklore sources)
Aim of the project.
The aim of the project is the creation of an electronic archive of
Bulgarian dialectal texts from folklore sources. This archive will be used in the study of
the dialectal riches of the Bulgarian language with the tools of computational linguistics
and on different language levels – phonetic, morphological, syntactic and lexical. In
Bulgaria, work on the creation of such archives is of relatively recent date and only
covers the literary language. Because of its specific character (the spontaneous
development of the dialectal system, together with a lack of conscious standardization),
the Bulgarian dialectal material closely reflects phenomena typical of the historical
development of the Bulgarian language: traces of a Balkan substratum, relation to the Old
Bulgarian language, similarities with and differences from other Slavonic languages,
effects of language contacts within the Balkan peninsula. This is particularly true of the
language of folklore, which is to a very great degree preserved from literary influence.
That is why the language of folklore, even though it has a supradialectal character -
inasmuch as it reflects peculiarities of wider dialectal areas, of specific dialects,
rather than the specific peculiarities of the dialects of separate settlements – is an
important source of linguistic investigation (Cf. Êî÷åâ Èâ.
Ìíîãîàñïåêòíîñò íà ïðîáëåìà çà äèàëåêòíîòî. –
Åçèê è ëèòåðàòóðà, 1979, êí. 1, 55-60). Folklore reflects those
peculiarities of folk speech which are related to folk mytho-poetic consciousness. It is
thus possible to find, in the language of folklore and especially folk poetry, words and
forms which appeared in the processes of folk etymology. Due to the phonetic and semantic
assimilation of words and forms, the texts preserve ancient forms and patterns which are
unknown to common speech and specific variants which, if traced back, could give valuable
information on the people's cultural history.
As a result of the investigation of the dialectal material with the
methods of computational linguistics, new data will be presented on major problems of
Bulgarian linguistics – the development towards analyticity, the changes in the word
order and syntactic structure of a language, the development of the category of
definiteness, the development of the semantic categories in the system of the Bulgarian
verb. The archive will also be useful for the study of problems of ethnolinguistics and
folklore.
Methods of work.
Recent developments in the are of computational corpus linguistics
are used for the storage and processing of the electronic archive of Bulgarian dialects.
Two software products are used: The package of language processing tools “The
Linguist’s Workbench”, developed at the Linguistic Modelling Laboratory and the
Institute for the Bulgarian Language at the Bulgarian Academy of Sciences and funded by
the Bulgarian National Science Foundation and the “Open Society Foundation” Sofia.
1. The Workbench consists of the following interrelated
programmes:
BUILD – an indexing programme allowing the creation of
alphabetical and frequency word lists and supplying information on the statistical
characteristics of a text. LEM/POS – a programme for semi-automatic lemmatization and
tagging. CONC – a concordancer displaying the co-occurrence patterns of word-forms and
lemmas with 10 environment positions accessible for alphabetic or frequency sorting. TREE
– a programme supporting the manual construction of treebanks and their automatic
querying. MIX – a programme for semi-automatic alignment of parallel texts allowing the
alignment and viewing of translation equivalents or different text versions in the same
language.
2. CLaRK is an XML-based software system for corpora development.
It incorporates several technologies: 1) XML technology; 2)
Unicode; 3) Regular Cascade Grammars; 4) Constraints over XML Documents.
For document management, storing and querying,
we chose the XML technology because of its popularity and its ease of understanding. The
core of CLaRK is an XML Editor, which is the main interface to the system. Besides the XML
language itself, we implemented an XPath language for navigation in documents and an XSLT
language for transformation of XML documents.
For multilingual processing tasks, CLaRK is
based on an Unicode encoding of the information inside the system. There is a mechanism
for the creation of a hierarchy of tokenisers. They can be attached to the elements in the
DTDs and in this way there are different tokenisers for different parts of the documents.
The basic mechanism of CLaRK for linguistic
processing of text corpora is the cascade regular grammar processor. The main challenge to
the grammars in question is how to apply them on XML encoding of the linguistic
information. The system offers a solution using an XPath language for constructing the
input word to the grammar and an XML encoding of the categories of the recognised words.
Several mechanisms for imposing constraints
over XML documents are available. The constraints cannot be stated by the standard XML
technology. The following types of constraints are implemented in CLaRK: 1) finite-state
constraints - additional constraints over the content of given elements based on a
document context; 2) number restriction constraints - cardinality constraints over the
content of a document; 3) value constraints - restriction of the possible content or
parent of an element in a document based on a context. The constraints are used in two
modes: checking the validity of a document regarding a set of constraints; supporting the
linguist in his/her work during the building of a corpus. The first mode allows the
creation of constraints for the validation of a corpus according to given requirements.
The second mode helps the underlying strategy of minimisation of the human labour.
The data of the electronic archive will be gradually increased and
expanded; it will incorporate texts from different areas of the Bulgarian language
territory.
The creation of the electronic archive of the Bulgarian dialects
will be supported by historical and etymological research. This is necessitated by the
specifics of the dialectal material, which forms a territorial variant of the national
language where language peculiarities have been developed that can only be compared to the
system of the modern Bulgarian literary language on the basis of the etymological
investigation and the setting out of the phonetic, morphological and other types of change
that have taken place.
Sponsors and participants in the project.
The project is carried out with funding from the “Open Society”
Foundation, Sofia and by a team from the Institute for the Bulgarian Language (IBL) and
the Central Laboratory for Parallel Processing(CLPP) at the Bulgarian Academy of Sciences:
Senior researcher Dr. Luchia Antonova-Vasileva (IBL), Senior researcher Dr. Maria
Stambolieva (IBL), Research associate Kiril Ivanov Simov (CLPP). Under the supervision of
K. Simov, two students of information science also work on the project – Alexander
Dimitrov Simov and Milen Ognianov Kujlekov (Faculty of Mathematics and Information
Science, Sofia University “St. Kliment Ohridski”.
Stages of the Project.
In the year 2001 work on the Electronic Archive of the Bulgarian
Language was centered round the automatic processing of the texts from the “Veda
Slovena” collection – a collection written in a dialect from the region of Gotse
Delchev and Drama. The publication first appeared in 1874: Âåäà Ñëîâåíà.
Áúëãàðñêè íàðîäíè ïåñíè îò ïðåäèñòîðè÷íà è
ïðåäõðèñòèÿíñêà äîáà. Îòêðèë â Òðàêèÿ è
ìàêåäîíèÿ è èçäàë Ñòåôàí È. Âåðêîâè÷. êí. I, 1874,
Áåîãðàä; êí. II, 1881, Ñ. Ïåòåðáóðã. Although it has been the object of
special interest on the part of the research and cultural circles, this collection,
containing Bulgarian epic songs which have a historical and ritual subject-matter and
which are close to folklore in their nature and form, was only republished in Bulgarian in
1997, by the “Open Society” Foundation. |