
The main objective of
BulTreeBank Project was to create a high quality set of syntactic
structures of Bulgarian sentences within the framework of HPSG.
Ideally, the tree bank should contain samples of all the
syntactic structures of the language. These sentences should
serve as templates for future corpora development, could become
the basis for the development of a more comprehensive test suite
for NLP applications, can be used as a source for grammar
extraction and for linguistic research. Within the project we
prolonged the development of CLaRK
system - XML-based system for corpora development.
The BulTreeBank Project was
based at the Linguistic Modelling
Laboratory (LML), Institute
for Parallel Processing, Bulgarian Academy of Sciences. The
project was funded by the Volkswagen Stiftung,
Federal Republic of Germany under the Programme
"Cooperation with Natural and Engineering Scientists in
Central and Eastern Europe". The work on the project
was carried out mainly at LML in tight cooperation with
researchers at the Seminar für
Sprachwissenschaft (SfS), Eberhard-Karls-Universitä
t, Tübingen, Germany. People responsible for the project
were Prof. Dr. Erhard Hinrichs from SfS
and Kiril Simov from LML.
A description of BulTreeBank project
Нова книга
Петя Осенова, Кирил Симов. Формална граматика на българския език. Институт по паралелна обработка на информацията - БАН. София, 18. 12. 2007 г. (Formal Grammar of Bulgarian Language. IPP, BAS.)
Here is a draft of Petya's habilitation (in Bulgarian). Any comments are welcome. Bulgarian Noun Phrases in HPSG.
Това е първи вариант на хабилитационния труд на Петя Осенова. Всякакви коментари са добре дошли. Именните фрази в българския език (с оглед на Опорната фразова граматика).
Възможност за работа:
Търсим млади хора за работа по нови международни проекти. Основни изисквания: програмиране на JAVA, английски език, желание за научна работа в областта на обработката на естествен език и онтологиите (Semantic Web). Желаещите да пишат на: Кирил Симов.
The core of CLaRK is an XML Editor, which is
the main interface to the system. Besides the XML language itself, we
implemented an XPath language for navigation in documents and an XSLT language
for transformation of XML documents. CLaRK is based on an Unicode encoding of
the information inside the system. The basic mechanism of CLaRK for linguistic
processing of text corpora is the cascaded regular grammar processor. Several
mechanisms for imposing constraints over XML documents are available. The
constraints cannot be stated by the standard XML technology.
Project Members
Kiril Simov
|
Principle researcher in Computational Linguistics and Represented
Knowledge - CLaRK (Project leader)
|
Petya Osenova
|
Researcher in Syntax, Semantics, Corpus Linguistics
|
Sia Kolkovska
|
Researcher in Syntax, Lexicography
|
Elisaveta Balabanova
|
Doctoral student in Formal Linguistics, Syntax
|
Alexander Simov
|
Research associate in Computer Science and Computational Linguistics
|
Milen Kouylekov
|
PhD student in information and communication technology in Trento, Italy
|
Krasimira Ivanova
|
Research associate in Computer Science and Computational Linguistics
|
Dimitar Doikoff
|
Master's student in Computational Linguistics
|
Ilko Grigorov
|
Master's student in Computer Science
|
Hristo Ganev
|
Master's student in Computer Science
|
We would like to thank Gergana Popova (PhD student, Department of Language
and Linguistics, University of Essex) and Atanas Kiryakov
(Head of OntoText Lab.) for their help in
preparation of the proposal of the project.
We would like to thank Dr. Frank Richter from SfS for his contribution to our better
understanding of the last trends in HPSG during his visit at the Linguistic Modelling
Laboratory at the beginning of the BulTreeBank Project.
Results
A set of Bulgarian sentences marked-up with detailed syntactic information. These sentences are mainly extracted from authentic Bulgarian texts. They are chosen with regards two criteria. First, they cover the variety of syntactic structures of Bulgarian. Second, they show the statistical distribution of these phenomena in real texts.
- Inside the TreeBank a core set of sentences ise designated. The purpose of this core set of sentences is to serve as a test-suite for software application processing texts on Bulgarian on the level of syntactic descriptions.
- Reliable partial grammar for automatic parsing of phrases in Bulgarian. This grammar is extensively tested and used during the creation of the TreeBank. It can be used as a module separate from the TreeBank in tasks which require only partial parsing of natural language texts such as Information Retrieval, Information Extraction, Data Mining from Texts and etc.
- Software modules for compiling, manipulating and exploring
the data in the TreeBank. This software supports both
the creation of the TreeBank, and its use for different
purposes such as automatic extraction of grammars for
Bulgarian.
Contacts
Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, IPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
Tel: (+359 2) 979 2825
Fax: (+359 2) 870 72 73
www.BulTreeBank.org