
The main objective of
BulTreeBank Project is to create a high quality set of syntactic
structures of Bulgarian sentences within the framework of HPSG.
Ideally, the tree bank should contain samples of all the
syntactic structures of the language. These sentences should
serve as templates for future corpora development, could become
the basis for the development of a more comprehensive test suite
for NLP applications, can be used as a source for grammar
extraction and for linguistic research. The development of such a
tree bank is inextricably linked to the development of a formal
grammar and a parser for Bulgarian. We propose to write two
grammars for Bulgarian, none of them exhaustive, but in our view
a necessary stage on the way to a complete formal grammar of the
language: one imposing very general constraints on the syntactic
structure and one for partial parsing. Within the project we will
prolong the development of CLaRK
system - XML-based system for corpora development.
The BulTreeBank Project is
based at the Linguistic Modelling
Laboratory (LML), Institute
for Parallel Processing, Bulgarian Academy of Sciences. The
project is funded by the Volkswagen Stiftung,
Federal Republic of Germany under the Programme
"Cooperation with Natural and Engineering Scientists in
Central and Eastern Europe". The work on the project
will be carried out mainly at LML in tight cooperation with
researchers at the Seminar für
Sprachwissenschaft (SfS), Eberhard-Karls-Universitä
t, Tübingen, Germany. People responsible for the project
are Prof. Dr. Erhard Hinrichs from SfS
and Kiril Simov from LML.
Нова книга
Петя Осенова, Кирил Симов.
Формална граматика на българския език. Институт по паралелна обработка на информацията - БАН. София, 18. 12. 2007 г. (Formal Grammar of Bulgarian Language. IPP, BAS.)
Възможност за работа:
Търсим млади хора за работа по нови международни проекти. Основни изисквания: програмиране на JAVA, английски език, желание за научна работа в областта на обработката на естествен език и онтологиите (Semantic Web). Желаещите да пишат на: Кирил Симов.
The core of CLaRK is an XML Editor, which is
the main interface to the system. Besides the XML language itself, we
implemented an XPath language for navigation in documents and an XSLT language
for transformation of XML documents. CLaRK is based on an Unicode encoding of
the information inside the system. The basic mechanism of CLaRK for linguistic
processing of text corpora is the cascaded regular grammar processor. Several
mechanisms for imposing constraints over XML documents are available. The
constraints cannot be stated by the standard XML technology.
Project Members
Kiril Simov
|
Principle researcher in Computational Linguistics and Represented
Knowledge - CLaRK (Project leader)
|
Petya Osenova
|
Researcher in Syntax, Semantics, Corpus Linguistics
|
Sia Kolkovska
|
Researcher in Syntax, Lexicography
|
Elisaveta Balabanova
|
Doctoral student in Formal Linguistics, Syntax
|
Alexander Simov
|
Research associate in Computer Science and Computational Linguistics
|
Milen Kouylekov
|
PhD student in information and communication technology in Trento, Italy
|
Krasimira Ivanova
|
Research associate in Computer Science and Computational Linguistics
|
Dimitar Doikoff
|
Master's student in Computational Linguistics
|
Ilko Grigorov
|
Master's student in Computer Science
|
Hristo Ganev
|
Master's student in Computer Science
|
We would like to thank Gergana Popova (PhD student, Department of Language
and Linguistics, University of Essex) and Atanas Kiryakov
(Head of OntoText Lab.) for their help in
preparation of the proposal of the project.
We would like to thank Dr. Frank Richter from SfS for his contribution to our better
understanding of the last trends in HPSG during his visit at the Linguistic Modelling
Laboratory at the beginning of the BulTreeBank Project.
Expected Results
A set of Bulgarian sentences marked-up with detailed syntactic
information. These sentences will be mainly extracted from authentic Bulgarian texts. They
will be chosen with regards two criteria. First, they will have to cover the variety of
syntactic structures of Bulgarian. Second, they we show the statistical distribution of
these phenomena in real texts.
- Inside the TreeBank a core set of sentences will be designated. The purpose of this core
set of sentences is to serve as a test-suite for software application processing texts on
Bulgarian on the level of syntactic descriptions.
- Reliable partial grammar for automatic parsing of phrases in Bulgarian. This grammar
will be extensively tested and used during the creation of the TreeBank. It will be used
as a module separate from the TreeBank in tasks which require only partial parsing of
natural language texts such as Information Retrieval, Information Extraction, Data Mining
from Texts and etc.
- Software modules for compiling, manipulating and exploring
the data in the TreeBank. This software will support both
the creation of the TreeBank, and its use for different
purposes such as automatic extraction of grammars for
Bulgarian.
Contacts
Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, IPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
Tel: (+359 2) 979 2825
Fax: (+359 2) 870 72 73
www.BulTreeBank.org