MethodologyThe main goal of the BulTreeBank project is to develop a high quality set (TreeBank) of syntactic trees for Bulgarian within the framework of Head-driven Phrase Structure Grammar (HPSG) - see (Pollard and Sag 1994). The term syntactic tree is used here for convenience, but in fact the actual syntactic structure for each sentence (or phrase) will be a graph in accordance with the common view in HPSG of linguistic objects. Our hope is to demonstrate the varieties of syntactic patterns in Bulgarian more exhaustively than it has been done so far and within a contemporary linguistic theory - HPSG.
The descriptions of the linguistic information in the TreeBank will be very detailed in order to demonstrate the information flow in the syntactic structure of the sentences in the TreeBank.
An annotation scheme usually has to be theory independent in order to allow different interpretations of the tagged texts in different linguistic frameworks. We think, however, that on a certain level of granularity (as it was mentioned linguistic descriptions in the BulTreeBank will be very detailed in order to demonstrate the information flow in the syntactic structure) we will have to exploit some linguistic descriptions that are theory dependent. We choose HPSG for the following reasons:
The Logical Formalism
We not only choose HPSG to be the linguistic theory within which we will explicate the syntactic structure of Bulgarian texts, but make a step further and choose the actual formalism that we will use in the annotation process: namely, SRL augmented (in an RSRL style) with some relations common in HPSG. For SRL see (King 1989), (King 1994) and (King 1999), for RSRL see (Richter 1997), (Richter 1999), (Richter, Sailer and Penn 1998), (Richter, Sailer and Penn 1999). For the annotation we will use SRL descriptions called feature graphs, which correspond to clauses in a normal form of a SRL theory. In their complete form feature graphs are morphs in the sense of (King 1994). Such detailed descriptions will be extremely useful in the future exploitation of the TreeBank, but they might be difficult to use in the annotation process. Here we hope to use the (special) inference mechanisms of SRL and some of the HPSG principles (universal and specific to Bulgarian) in order to enable the annotator to provide only part of the needed information with the rest of it being inferred automatically (see (King and Simov 1998) for a special inference for such purposes).
Partial Analyses and Predictions
As was mentioned in the previous paragraph, the detailed level of the envisaged syntactic descriptions will require entering a huge amount of linguistic information for each sentence. In order to minimize the necessary human intervention, we will exploit all possibilities to provide an automatic partial analysis of the input string before the actual annotation starts. We would also use the partial information entered by the annotator in order to predict or constrain the possible analyses in other parts of the whole description of the element. In this way we will exploit all the constraints available from pre-encoded grammars.
The development of an HPSG-based bank of syntactic trees will comprise the following tasks:
The linguistic knowledge represented in the sort hierarchy will be the backbone of the syntactic descriptions of the Bulgarian data. It will define all possible linguistic structures that will be further constrained by the grammar and/or the information entered by the annotators. The annotation schemata that will be employed during the project will allow for composite tag definitions. It will be possible to decompose each tag in the syntactic structure so that the grammatical information represented by the tag and its elements get distributed to the relevant substructures.
Several well-defined subgrammars are already under development. These subgrammars follow ideas from information extraction and are concerned with closed sets of expressions like numerical expressions, dates, names like “the city of Sofia” or “Prof. Petrov”. Each of these subgrammars will assign to each such expression an appropriate description that will allow the consequent incorporation of the expression within the whole syntactic structure of the sentence.
The result of the application to each sentence of the constraints encoded in the HPSG sort hierarchy, HPSG principles, morphological analyzers and the partial grammar could be considered similar to the supertag ideas in (Joshi, Srinivas 1994). The annotators will have access to all linguistic information relevant to a given sentence and will have to choose the right analyses where the information is ambiguous.
Example sentences will be extracted in the first instance from published Bulgarian grammars which already give some analyses and provide certain coverage of syntactic phenomena in Bulgarian. With respect to syntactic patterns that are not covered by the existing grammars we will proceed as follows: (1) we will first search for the respective patterns in real texts using some of the concordance tools we already have implemented in the CLaRK system; and (2) if we cannot find exemplary sentences we will construct artificial examples.
In this core set of sentences we will try to cover the following basic syntactic phenomena: sentence and clause types, complementation, agreement, modification, diathesis, modality, tense and aspect, word order, coordination, negation. These universal phenomena will be further detailed with respect to the language-specific characteristics of Bulgarian. Some of these are:
Bulgarian is a highly inflective, null-subject language. It allows for an object doubling and imposes special word order constraints on both pronominal and verbal clitics. It has a rich analytical verbal complex with discontinuous behavior. One of the peculiarities is a clitic-like definite article.
We can view this part of the TreeBank as a “draft” test-suite (see the TSNLP project (Oepen et. al. 1994) and the Polish HPSG TreeBank (Marciniak et. al. 1999)). Following the guidelines for test-suites, we will strive to give negative constraints and examples within the core set of sentences. These will be used also during the second phase of the annotation of the sentences. Whenever the sentences chosen turn out to be ambiguous, we will rank the ambiguity according to the plausibility of the corresponding reading. At this stage of the development of the TreeBank the plausibility of the analysis will be judged by the annotators on the basis of their intuitions about the usability of the corresponding sentence or phrase. We hope that such plausibility information will turn out to be useful during the annotation of the base set of sentences as well.
Sentence extraction. Sentences will be chosen from the text database collected during the CLaRK Programme (this database is being constantly updated and enlarged). A description of this text archive is given below. We will use the concordancer developed under the CLaRK project. The concordance software will have access to the morphological information for the words in the texts and it will allow queries similar to, for example: “Find all sentences in which a personal pronoun is followed by a passive participle which is followed by a preposition.”
Automatic pre-processing. Each annotated sentence will be pre-processed to allow the system to add linguistic information that possibly describes its syntactic structure. This pre-processing will include:
(i) Morphosyntactic tagging — each word will be marked up with the appropriate morphological information, namely: part of speech, gender, number, tense, person, transitivity, etc.;
(ii) Part-of-speech disambiguator — for each ambiguous word the most probable part-of-speech will be predicted;
(iii) Partial parsing — where possible minimal constituents will be identified;
(iv) Addition of syntactic information — on the basis of the results from the previous steps and the general information available from the type hierarchy and the universal grammar the possible syntactic information will be given.
Obviously, the precision of the information added with each step decreases. We can be quite certain of the accuracy of the analyses on the morphological level but the syntactic analyses will be incorrect in a large number of cases or will contain a high degree of ambiguity. Therefore, each consecutive step will require increased human intervention, with most of it needed on the syntactic level. The aim of the pre-processing is to minimize the amount of this human effort.
The results of the automatic pre-processing will be loaded into the CLaRK system in two forms:
(i) XML mark-up. This will be the actual annotation of the sentence. The annotator will have to edit this information in order to describe the sentences fully.
(ii) Constraints over XML documents. The constraints from the grammars will be encoded as constraints over XML documents and will guide the annotator in the process of annotation but this information will not be part of the actual syntactic description of the sentence.
To avoid the possible combinatorial explosion on the last level we plan to apply special compilation techniques which will distribute syntactic information locally onto the overall structure and will encode part of that information as constraints over the structure.
Manual annotation. The ambiguities remaining after the pre-processing stage will have to be resolved manually by the annotators. The system will be able to support this process by propagating constraints that follow automatically from the information supplied by the annotator. To give a simple example, if the annotator has had to specify one of two daughters as the head daughter, the system will automatically percolate the relevant head-features to the mother and further up the tree.
References[Abney St 1990] Syntactic Affixation and Performance Structures. In Bouchard D. Leffel K. (eds), Views on Phrase Structure. Kluwer Academic Publishers, Dordrecht.
[Abney St 1991] Parsing By Chunks. In: Berwick R., Abney St., Tenny C. (eds), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht.
[Avgustinova T 1997] Word Order and Clitics in Bulgarian. Saarbrucken Dissertations in Computational Linguistics and Language Technology, Volume 5. University of Saarbrucken.
[Joshi AK, Srinivas B 1994] Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing. In: Proceedings of the 17th International Conference on Computational Linguistics (COLING '94). Kyoto, Japan.
[King PJ 1989] A Logical Formalism for Head-Driven Phrase Structure Grammar. Doctoral thesis. Department of Mathematics, University of Manchester, Manchester, England.
[King PJ 1994] An expanded logical formalism for Head-driven Phrase Structure Grammar. Sonderforschungsbereich 340 technical report 59. Sonderforschungsbereich 340, Seminar fur Sprachwissenschaft, Eberhard-Karls-Universitat, Tubingen, Germany.
[King PJ 1999] Towards Truth in HPSG. In: Kordoni V. (ed), Tubingen Studies in Head-Driven Phrase Structure Grammar. Volume 2, pages 301-352. Arbeitspapiere des SFB 340, Bericht Nr. 132. SFB 340, Tubingen, Germany.
[King PJ, Simov K 1998] The automatic deduction of classificatory systems from linguistic theories. Grammars, 1(2): 103-153. Kluwer Academic Publishers, The Netherlands. Here you could download a draft of the paper: Postscript version, Zipped Postscript version.
[Oepen S., Netter K., Klein J 1998] TSNLP - test suites fir natural language processing. In: Nerbonne J, (ed) Linguistic Databases CSLI Lecture Notes. CSLI Publication, Stanford, USA
[Marciniak M, Mykowiecka A, Przepi'orkowski A, Kup's'c A 1999] Construction of an HPSG TreeBank for Polish. In Proceedings of the ATALA conference.
[Pollard C, Sag I 1994] Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago, Illinois, USA.
[Popov D, Simov K, Vidinska S 1998] A Dictionary of Writing, Pronunciation and Punctuation of Bulgarian Language. (In Bulgarian). Atlantis LK, Sofia, Bulgaria.
[Richter F 1997] Die Satzstruktur des Deutschen und die Behandlung langer Abhangigkeiten in einer Linearisierungsgrammatik. Formale Grundlagen und Implementierung in einem HPSG-Fragment. In: Hinrichs E, Meurers D, Richter F, Sailer M, Winhart H (eds) Ein HPSG-Fragment des Deutschen. Teil 1: Theorie. SFB Report 95 Universitat, Tubingen, Germany.
[Richter F 1999] RSRL for HPSG. In: Kordoni V. (ed) Tubingen Studies in Head-Driven Phrase Structure Grammar. Arbeitspapiere des SFB 340, Nr. 132, Volume 1. Universitat, Tubingen, Germany.
[Richter F, Sailer M, Penn G 1998] A Formal Interpretation of Relations and Quantification in HPSG. In: Bouma G, Kruijff G.-J.M., Oehrle R.T. (eds): Proceedings of the FHCG-98. Saarbrucken, Germany.
[Richter F, Sailer M, Penn G 1999] A Formal Interpretation of Relations and Quantification in HPSG. In: Bouma G, Hinrichs E, Kruijff G.-J.M., Oehrle R.T. (eds): Constraints and Resources in Natural Language Syntax and Semantics. CSLI Publications. Stanford, USA.
[Simov K, Popov D 1996] Creating a morphological dictionary of the Bulgarian language. In: Proceedings of COMPLEX'96 Conference. Budapest, Hungary.
[Simov K, Angelova G, Paskaleva E. 1990] MORPHO-ASSISTANT: The proper treatment of morphological knowledge. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING '90), volume 3, pp 453-457. Helsinki, Finland.
[Simov K, Paskaleva E, Damova M, Slavcheva M 1992] MORPHO-ASSISTANT - a knowledge based system for Bulgarian morphology. In: Proceeding of Demo Descriptions of Third conference on Natural Language Application. Trento, Italy.
[Text Encoding Initiative 1997] Guidelines for Electronic Text Encoding and Interchange. Sperberg-McQueen C.M., Burnard L (eds).