CLaRK - an XML Based System For Corpora Development
Unicode XML Editor, XPath Engine, XSLT Engine, XML Constraints, XML Cascaded Regular Grammar Engine
CLaRK System version 3.0 is now available. (Download)
Възможност за работа:
Търсим млади хора за работа по нови международни проекти. Основни изисквания: програмиране на JAVA, английски език, желание за научна работа в областта на обработката на естествен език и онтологиите (Semantic Web). Желаещите да пишат на: Кирил Симов.
If you have any problems with downloading the system, please contact us for support!
CLaRK Support Team
CLaRK is an XML-based software system for corpora development implemented in JAVA. The main aim behind the design of the system is the minimization of human intervention during the creation of language resources. It incorporates several technologies:
For document management, storing and querying, we chose the XML technology because of its popularity and its ease of understanding. The core of CLaRK is an Unicode XML Editor, which is the main interface to the system. Besides the XML language itself, we implemented an XPath language for navigation in documents and an XSLT language for transformation of XML documents.
For multilingual processing tasks, CLaRK is based on an Unicode encoding of the information inside the system. There is a mechanism for the creation of a hierarchy of tokenisers. They can be attached to the elements in the DTDs and in this way there are different tokenisers for different parts of the documents.
The basic mechanism of CLaRK for linguistic processing of text corpora is the cascaded regular grammar processor. The main challenge to the grammars in question is how to apply them on XML encoding of the linguistic information. The system offers a solution using an XPath language for constructing the input word to the grammar and an XML encoding of the categories of the recognised words.
Several mechanisms for imposing constraints over XML documents are available. The constraints cannot be stated by the standard XML technology. The following types of constraints are implemented in CLaRK:
The constraints are used in two modes: checking the validity of a document regarding a set of constraints; supporting the linguist in his/her work during the building of a corpus. The first mode allows the creation of constraints for the validation of a corpus according to given requirements. The second mode helps the underlying strategy of minimisation of the human labour.
We envisage several uses for our system:
CLaRK Overview (not updates since version 1.0)
Kiril Simov, Zdravko Peev, Milen Kouylekov, Alexander Simov, Marin Dimitrov, Atanas Kiryakov. CLaRK - an XML-based System for Corpora Development. In: Proc. of the Corpus Linguistics 2001 Conference, pages: 558-560. Zipped PDF version
Kiril Simov, Milen Kouylekov, Alexander Simov. Cascaded Regular Grammars over XML Documents. In: Proc. of the 2nd Workshop on NLP and XML (NLPXML-2002), Taipei, Taiwan. September 1, 2002. (to appear). (Postscript version, Zipped Postscript version, Zipped PDF version).
Kiril Simov, Alexander Simov, Milen Kouylekov, Krassimira Ivanova. CLaRK System: Construction of Treebanks. In: Proc. of The First Workshop on Treebanks and Linguistic Theories (TLT2002), 20th and 21st September 2002, Sozopol, Bulgaria. pages 183-198.
Kiril Simov, Alexander Simov, Milen Kouylekov. Constraints for Corpora Development and Validation. In: Proc. of the Corpus Linguistics 2003 Conference, pages: 698-705.
Kiril Simov, Alexander Simov, Milen Kouylekov, Krasimira Ivanova, Ilko Grigorov, Hristo Ganev. Development of Corpora within the CLaRK System: The BulTreeBank Project Experience. In: Proc. of the Demo Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL'03), Budapest, Hungary. 2003.