CLaRK System

CLaRK System Online Manual

Bulgarian dialects'
electronic archive

HPSG-based Syntactic Treebank of Bulgarian

The main objective of BulTreeBank Project was to create a high quality set of syntactic structures of Bulgarian sentences within the framework of HPSG. Ideally, the tree bank should contain samples of all the syntactic structures of the language. These sentences should serve as templates for future corpora development, could become the basis for the development of a more comprehensive test suite for NLP applications, can be used as a source for grammar extraction and for linguistic research. Within the project we prolonged the development of CLaRK system - XML-based system for corpora development.

The BulTreeBank Project was based at the Linguistic Modelling Laboratory (LML), Institute for Parallel Processing, Bulgarian Academy of Sciences. The project was funded by the Volkswagen Stiftung, Federal Republic of Germany under the Programme "Cooperation with Natural and Engineering Scientists in Central and Eastern Europe". The work on the project was carried out mainly at LML in tight cooperation with researchers at the Seminar für Sprachwissenschaft (SfS), Eberhard-Karls-Universitä t, Tübingen, Germany. People responsible for the project were Prof. Dr. Erhard Hinrichs from SfS and Kiril Simov from LML.

A description of BulTreeBank project

Петя Осенова, Кирил Симов. Формална граматика на българския език. Институт по паралелна обработка на информацията - БАН. София, 18. 12. 2007 г. (Formal Grammar of Bulgarian Language. IPP, BAS.)

Here is a draft of Petya's habilitation (in Bulgarian). Any comments are welcome. Bulgarian Noun Phrases in HPSG.

Това е първи вариант на хабилитационния труд на Петя Осенова. Всякакви коментари са добре дошли. Именните фрази в българския език (с оглед на Опорната фразова граматика).

CLaRK system - XML-based system for corpora development

The core of CLaRK is an XML Editor, which is the main interface to the system. Besides the XML language itself, we implemented an XPath language for navigation in documents and an XSLT language for transformation of XML documents. CLaRK is based on an Unicode encoding of the information inside the system. The basic mechanism of CLaRK for linguistic processing of text corpora is the cascaded regular grammar processor. Several mechanisms for imposing constraints over XML documents are available. The constraints cannot be stated by the standard XML technology.

Technical Reports

Project Members

Kiril Simov

Principle researcher in Computational Linguistics and Represented Knowledge - CLaRK (Project leader)

Petya Osenova

Researcher in Syntax, Semantics, Corpus Linguistics

Sia Kolkovska

Researcher in Syntax, Lexicography

Elisaveta Balabanova

Doctoral student in Formal Linguistics, Syntax

Alexander Simov

Research associate in Computer Science and Computational Linguistics

Milen Kouylekov

PhD student in information and communication technology in Trento, Italy

Krasimira Ivanova

Research associate in Computer Science and Computational Linguistics

Dimitar Doikoff

Master's student in Computational Linguistics

Ilko Grigorov

Master's student in Computer Science

Hristo Ganev

Master's student in Computer Science

We would like to thank Gergana Popova (PhD student, Department of Language and Linguistics, University of Essex) and Atanas Kiryakov (Head of OntoText Lab.) for their help in preparation of the proposal of the project.

We would like to thank Dr. Frank Richter from SfS for his contribution to our better understanding of the last trends in HPSG during his visit at the Linguistic Modelling Laboratory at the beginning of the BulTreeBank Project.


  • A set of Bulgarian sentences marked-up with detailed syntactic information. These sentences are mainly extracted from authentic Bulgarian texts. They are chosen with regards two criteria. First, they cover the variety of syntactic structures of Bulgarian. Second, they show the statistical distribution of these phenomena in real texts.
  • Inside the TreeBank a core set of sentences ise designated. The purpose of this core set of sentences is to serve as a test-suite for software application processing texts on Bulgarian on the level of syntactic descriptions.
  • Reliable partial grammar for automatic parsing of phrases in Bulgarian. This grammar is extensively tested and used during the creation of the TreeBank. It can be used as a module separate from the TreeBank in tasks which require only partial parsing of natural language texts such as Information Retrieval, Information Extraction, Data Mining from Texts and etc.
  • Software modules for compiling, manipulating and exploring the data in the TreeBank. This software supports both the creation of the TreeBank, and its use for different purposes such as automatic extraction of grammars for Bulgarian.


