CLaRK is an XML-based software system for corpora
development. The main aim behind the design of the system is the minimization
of human intervention during the creation of language resources. It incorporates several technologies:
XML technology;
Unicode;
Regular Cascaded Grammars;
Constraints over XML Documents.
For document management, storing and querying, we chose the
XML technology because of its popularity and its ease of understanding. The core
of CLaRK is an Unicode XML Editor, which is the main interface to the system.
Besides the XML language itself, we implemented an XPath language for navigation
in documents and an XSLT language for transformation of XML documents.
For multilingual processing tasks, CLaRK is based on an
Unicode encoding of the information inside the system. There is a mechanism for
the creation of a hierarchy of tokenisers. They can be attached to the elements
in the DTDs and in this way there are different tokenisers for different parts
of the documents.
The basic mechanism of CLaRK for linguistic processing of
text corpora is the cascaded regular grammar processor. The main challenge to
the grammars in question is how to apply them on XML encoding of the linguistic
information. The system offers a solution using an XPath language for
constructing the input word to the grammar and an XML encoding of the categories
of the recognised words.
Several mechanisms for imposing constraints over XML
documents are available. The constraints cannot be stated by the standard XML
technology. The following types of constraints are implemented in CLaRK:
Regular expression constraints - additional constraints over the
content of given elements based on a context;
Number restriction constraints - cardinality constraints over the
content of a document;
Value constraints - restriction of the possible content or parent of
an element in a document based on a context.
The constraints are used in two modes: checking the validity
of a document regarding a set of constraints; supporting the linguist in his/her
work during the building of a corpus. The first mode allows the creation of
constraints for the validation of a corpus according to given requirements. The
second mode helps the underlying strategy of minimisation of the human
labour.
We envisage several uses for our system:
Corpora markup. Here users work with the XML
tools of the system in order to mark-up texts with respect to an
XML DTD. This task usually requires an enormous human effort and
comprises both the mark-up itself and its validation afterwards.
Using the available grammar resources such as morphological
analyzers or partial parsing, the system can state local
constraints reflecting the characteristics of a particular kind of
texts or mark-up. One example of such constraints can be as
follows: a PP according to a DTD can have as parent an NP or VP,
but if the left sister is a VP then the only possible parent is
VP. The system can use such kind of constraints in order to
support the user and minimize his work.
Dictionary compilation for human users.
The system will support the creation of the actual lexical entries
whose structure will be defined via an appropriate DTD. The XML
tools will be used also for corpus investigation that provides
appropriate examples of the word usage in the available corpora.
The constraints incorporated in the system will be used for
writing a grammar of the sublanguages of the definitions of the
lexical items, for imposing constraints over elements of lexical
entries and the dictionary as a whole.
Corpora investigation. The CLaRK System
offers a rich set of tools for searching over tokens and mark-up
in XML corpora, including cascaded grammars, XPath language. Their
combinations are used for tasks such as: extraction of elements
from a corpus - for example, extraction of all NPs in the corpus;
concordance - for example, give me all NPs in the context of their
use ordered by a user defined set of criteria.