This tool handles documents, which contain symbols, not supported by the local hardware architecture. It
substitutes the symbols with entities according to the standard ISO 8879 and vice versa. Currently, this tool
supports 19 sub-sets of entity-char conversions. Each of them can be activated or deactivated.
One reason for excluding some of the sub-sets is the following : sometimes not all the symbols have to be converted,
for example: commas, dots, colons, semicolons ....
Example: ("дума" in Bulgarian is the equivalent of "word")
"дума" <-- entity conversion -->
"дума"
The tool operates on the document which is
currently opened in the system or on a set of documents from the Internal documents database.
It can be started from menu item: Tools/Entity Converters.
The dialog window looks in the following way:
The window represents a list of converters (filters) which will be used in the replacement
procedure. The list content can be managed by using buttons:
- Add Flter - enables the addition of a new filter or a set of filters to the current list content.
Having pressed this button the user is shown a list of all available filters which are not presented yet in
the working list. The list is placed in a new dialog window with the following layout:

Here the user selects filters to be added. The control buttons are as follows:
- Add - includes the selected item(s) in the working list;
- Preview - shows in details the currently selected filter in the list;
- Cancel - closes the dialog without without any other action;
- Add All - includes all list items in the working list.
- Remove Filter - removes the selected item(s) from the working list.
- View Filter - shows detailed information about the selected item (filter) in the list.
The information is visualized in a table form, where each row represents a single character-to-entity
mapping. The table has three columns: Entity (the literal representations of the entities),
Value (the character (uni)codes in hexadecimal format) and Preview (the characters themselves).

The direction of conversion is determined by the two radio buttons: Entity to Character and
Character to Entity.
Additionally, the user can restrict the scope of conversion application, i.e. the conversion can be applied
only to certain places in the documents, leaving the rest unchanged. If no restriction is used the conversion
is applied to all attributes, text and comment nodes in the document(s). In order restriction to be set, the
Enable Filtering checkbox must be selected. The user is expected to supply an XPath expression
which will select the nodes on which the conversion will be applied. Each application on a node also includes
conversion of the whole content of the node, i.e. all descending nodes suitable for this operation. Thus, if for
example, only data included in paragraphs must be processed, the XPath expression must select only the
paragraph nodes.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported
here.
A Grammar in the CLaRK System is defined as a set of rules.
Each rule consists of three regular expressions and a category
(represented as an XML fragment, called Return Markup). The three regular
expressions are called: Regular Expression, Left Regular Expression, Right
Regular Expression. The Regular Expression determines the content which the
rule can be applied to. The Left and the Right Regular Expression determine the
left and the right context of the content the rule recognises (if there is no
constraints over one of the contexts, then the corresponding expressions are
empty). When the rule is applied and recognises some part of an XML document,
the part is substituted by the return markup of the rule. If it is necessary to
keep the recognised part, it can be cited by using the variable
\w. If the user needs to use the string \w in the return markup, he/she can
avoid the \w variable in the following way: ^\w
The regular grammars in the CLaRK System work over token and/or
element values generated from the content of an XML document and they
incorporate their results back in the document as XML mark-up.
The tokens are determined by the corresponding tokenizer.
Before having been used in the grammar, each XML element is converted
into a list of textual items. This list is called element value for the XML
element. The element values are defined with the help of XPath keys, which
determine the important information for each element.
In the grammars, the token and element values are described
by token and element descriptions. These descriptions could contain wildcard
symbols and variables. The variables are shared among the token descriptions
within a regular expression and can be used for the treatment of phenomena like
agreement.
Here is the list of the token and element descriptions:
"token" -> describes the token itself. This
description can be matched to the token itself and nothing else.
$TokenCategory -> describes all tokens of
the category TokenCategory. In the grammar input this description
is matched against exactly one token of this category.
Wildcard Symbols: #, @,
% -> describe substrings of a given token. # -
describes a substring of arbitrary length from 0 to infinity, @ -
describes a substring of arbitrary length from 0 or 1, % -
describes a substring of arbitrary length one. Here are some examples:
"lov#": matches exactly one token which
starts with "lov": "lov", "love", "loves", and many others.
"lo#ve": matches exactly one token which
starts with "lo" and ends with "ve": "love", "locative", "locomotive" etc.
"%og":matches exactly one token which ends
in "og": "bog", "dog", "fog", "jog" etc.
"do":matches exactly one token which starts
with "do": "do", "doe", "dog", "don" and many others.
Variables: &V -> describes some substring in
a token, when initialised for the first time, then matches to the same substring in
the same token, or some following tokens. The scope of a variable is one grammar
rule. The variable can be used in the return mark-up and in this case the value
of the variable is copied into the return mark-up. Each variable consists of the
symbol & followed by a single Latin letter. Each variable has
positive and/or negative constraints over the possible values. Both the positive
and the negative constraints over variables are given by lists of token
descriptions. The value assigned to a variable during the application of the
grammar has to be described by one of the positive constraints and must not be
described by any of the negative constraints. These token descriptions can
contain wildcard symbols, but no other variables. Here are some examples:
"A&N&G", "Nc&N&G". These token descriptions can be
used in a rule to ensure the agreement in number and gender between an adjective
and a noun.
Complex token descriptions. The user can combine the
above descriptions in one token description. Some examples:
"lov%#", "Vp&N&G#&P"
Element description is a regular expression in angle
brackets: < Regular Expression >. Here the
Regular Expression is over token descriptions which is matched against the
element value. Examples:
<w>: matches exactly one
w element.
<$TokenCategory>: matches exactly
one element, whose element value is a token description with category
TokenCategory
<"token"> : matches exactly one
element, whose element value is the token itself and nothing else. A token can
contain wildcard symbols and variables.
<<N>> : matches exactly one
element, whose element value is the XML element N.
The application of a rule works in the following way. The
element value is scanned from left to right. Its Regular Expression is evaluated
from the current point. If it recognises a part of the element value (this part
we will call a match of the rule), then the regular expressions for the
left and for the right contexts are evaluated (if they are not empty). If they
are satisfied by the context of the match, then the match is substituted in the
return markup for each presence of the variable \w (The user must be careful if
he/she has, for example, text like \word in the return markup, the
beginning \w will be substituted by the match. In this case
the variable must be escaped). After these substitutions, the new markup is
substituted in the XML document instead of the match place.
When a regular expression is evaluated from a given point
within the element value, there is a possibility for several matches to the
expression. For instance, the expression (A,B)+ over the element
value L,A,B,A,B,A can recognise two matches from the second
possition: A,B and A,B,A,B. This allows for a
non-deterministic choice in this place. One can choose either the shortest
match, or the longest one, or some in between. Generally, there are no universal
principles for making such a choice. This is why in the CLaRK system we allow for
user definition of a strategy to choose a match among more choices. We envisage four strategies: shortest
match - in this case the system always selects the shortest possible match; longest match - in
this case the system always selects the longest possible match; any up -
in this case the possible matches are enumerated from the shortest to the
longest possible match up to the moment when the left and/or the right context
of the match satisfy the Left and/or the Right Regular Expression. If there is
no Left and Right Regular Expressions then any up strategy is the same as
shortest match; any down - it is similar to any up except for the fact
that the possible matches are enumerated from the longest to the shortest one.
These strategies are specified within the grammar queries. This allows the same
grammar to be applied with different strategies over different documents.
The definition and application of a grammar are
separated within the CLaRK system. The grammar itself is defined at one place,
the parameters for its application are defined at another place in the form of
grammar queries. This separation allows the use of the same grammar with
different parameters like different tokenizers, different element values,
different filters etc. Each of them has an XML representation. These XML
representations allow grammars and their queries to be exchanged among different
users. Also this allows the grammars to be constructed out of the system and
then imported within it.
The grammar definition consists of a set of rules, variable
definitions and context evaluation parameter (Check Context Order). The rules
have already been discussed. The variable definitions are given by the positive and
negative constraints over the variable. The context evaluation parameter
determines the regular expression for which the context will be checked first - the
left and then the right, or vice versa.
The grammar application determines: the elements which the
grammar will be applied to; the element values for those elements, including the
tokenizer and the filter; whether the textual elements will be normalized; the
application strategy (longest, shortest match, any up and any down match).
The grammar manager is the user interface in the CLaRK System
for management of grammar definitions. It supports the user in the
creation, modification, deletion of grammars.
The main dialog is
Entry Manager
with additional buttons. It contains all of the available grammars arranged in a tree hierarchy,
some of their features (Editable, Compiled) and buttons for management of the grammars.
Each grammar has to be compiled in order to be used in the
system. The compilation converts the regular expressions into a finite-state
automaton. Because the compilation is a heavy process, sometimes it is better to postpone the
grammar compilation (For example, when a large grammar is
imported into the system). The column Compiled in the table
represents information whether the grammar is compiled or not. If the grammar is
compiled the corresponding box is checked.
The system also allows the user to export and import already
compiled grammars. This option is very useful in cases of large grammars, when
the compilation takes longer time or the user wants to exchange just the compiled
grammar, but not its source. If such a grammar is used in the system, it cannot
be edited. In such cases, the check box in the column Editable is
not checked.
Here is the main window of the Grammar Manager:

There are several grammars. The grammar Slovnik
One is editable and also compiled. Thus if necessary, the user can open
it in the Grammar Editor and can modify it. The grammar Slovnik
One[1] is editable, but not compiled. Thus it cannot be used
immediately, first it has to be compiled. The grammar Slovnik
One[3] is available only in compiled form. It cannot be modified in this
form, but it can be used to process the relevant documents.
The manager window consists of two main parts:
- The panel on the left. It contains the tree representations of the group hierarchy. When the user
selects a node in this tree, the content of the corresponding group is loaded in the component on the
right side.
- Current group monitor. This is the panel situated on the right side of the window. It is
a list with the content of the currently selected group in the tree. The list components,
which are in blue color, are sub-groups. The other ones are the grammars included in
this group. They are colored in black or red, depending on whether they are valid or not,
according to their DTDs. The user can sort all the grammars in a group by clicking on the
Name
column of the table header. It is possible to rearrange the grammars in a group by simply using
drag-and-drop technique, i.e. pressing a grammar and moving upwards or downwords until the
desired position is reached.
There are six additional buttons which can be used for
modification the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to give a
name for it;
- Remove - removes the selected grammars and/or groups from the list. The removal
is preceded by a confirmation message. If the selection includes sub-groups, they are also removed
with their entire contents.
- Rename - give a new name of the selected grammar from the list.
- Copy - save the data of the selected grammar from the list with different name
given by the user.
- Add Grammars - gives a list of all grammars which are not present in the
current group. The user is expected to choose one or more grammars to be included in the
current group.
- Delete! This function can be used for removing grammars from the internal
grammar database. It can be applied only to single grammars, not to whole groups. Groups are
excluded from any selections. The removal of grammars is preceded by a confirmation message.
The grammars to be removed, are excluded from all the groups they may belong to.
Navigation in the group structure can be made also in the panel on the right.
When the user wants to see the content of a certain sub-group of the current group, s/he just has
to perform a double click on the desired sub-group. This will change the current group to
the new one. This represents the movement from a group to a sub-group. The movement in the
other direction is also possible. For each grammar group (excluding the Grammars), a special
sub-group is included, named: ". .". By performing a double click on it,
similarly to most file systems, the current group is changed to its parent one.
The Grammar Manager also provides a list of all the grammars, no
matter which group they are included in. The following information appears for each
grammar: date - when it has been last modified; if it is a query, which tool it refers
to; and which is its DTD. The user can sort all the grammars by clicking on the Name
column of the table header. When selecting grammars, the right button of the mouse is
used to visualize the Pop-up menu with the following operations on selected grammars:

- Info - This item shows the following information about the selected grammars:
- grammar name
- grammar size
- grammar's dtd name
- whether the grammar is valid
- group of the grammar
- Add In Group... - This item shows a dialog with the hierarchical structure for the groups
in the system and the user can choose a group in which to place all the selected grammars.
- Delete! - It is described above.
- Rename - It is described above.
- Copy - It is described above.
Under the table with the available grammars there are 3
buttons (New, Edit,
Compile) and 3 menus (Compiled
Grammar, File I/O, XML Editor
). They can be used to manage the grammars.
Buttons:
New - creates a new empty grammar
with a name specified by the user and opens the Grammar Editor;
Edit - opens the selected grammar in
the Grammar Editor;
Compile - compiles the selected
grammars. This operation is relatively slow. For large grammars (with thousands
of rules) it might take several minutes;
Apply - switches from the Grammar Manager dialog to the
Apply Grammar
dialog and allows the user to apply some of the grammars;
Exit - exits the Grammar Manager
dialog.
Menus:
The three menus allow the import and export of grammars from
and to the system. As it was said above, the grammars in CLaRK have an XML
representation. Thus the user can load such grammar from an external file, or
from a document within the system. Also, the user can save a grammar created
within the system in an external file or as a document in the system. In this
way the user can exchange grammars with other users, or make backup copies of
the them, or can process them with other tools in system, such as sorting,
searching etc. Additionally, there is a format for saving and loading
compiled grammars.
Compiled Grammar

This menu gives the user a possibility to store and load
grammars in compiled format into/from a file. It has the following items:
Load compiled grammar from file - the user is
asked to choose a file which contains the compiled grammar. Then the system
reads the file, interprets it as a CLaRK finite-state automaton and stores it in
the grammar database of the system. Such a grammar cannot be modified, but it
can be applied.
Save compiled grammar to file - the user can save
a compiled grammar into a file. The file has a special format and it cannot be
modified in any reasonable way, thus it can be used just for exchanging of grammars in
compiled form.
File I/O

This menu gives the user a possibility to store and load
grammars in XML format into/from a file. It has the following items:
Load grammars from file - the user is asked to
choose a file which contains the grammars in XML format. Then the system reads
the file, interprets it as CLaRK grammars and stores them in the grammar database
of the system in a table format. Such a grammars can be modified within the
system.
Save grammars to file - the user can save
editable grammars into a file.
The file is an XML document and it can be modified out of the system.
Save grammars with groups to file - the user can save
editable grammars into a file and the group structure for the selected grammars.
The file is an XML document and it can be modified out of the system.
XML Editor

This menu gives the user a possibility to store and load
editable grammars in XML format into/from an internal for the system XML
documents. It has the following items:
Load current document as grammar - the user has to
open the document which contains the grammar as a current document in the system . When this item is chosen,
the system reads the document, interprets it as a CLaRK grammar and stores it in the grammar database of the
system in a table format. Such a grammar can be modified within the system. This option
allows the user to create grammars from the documents produced by other tools in
the system and load them in the Grammar Manager.
Edit grammar in editor - the selected grammar is
converted from table format into an XML format and is loaded as a document in
the system. Thus it becomes a current document of the system. The user can
manipulate the document with the tools of the system. Useful processing can
include sorting, searching, etc.
Grammar Editor
This is the editor for grammars in the CLaRK system in table
format. The editor contains the following elements: Rules,
Option, Variables, and three buttons
Save, Compile, Exit
. Here is the main window of the Grammar Editor:

Rules
The table Rules contains the rules of the
grammar. Each row of the table represents one rule. The columns follow the
structure of the grammar rules in CLaRK. The column Regular
Expression has to contain the regular expression which will be matched with
respect to the element content. The column Return Markup is the
second obligatory element in a rule. It contains the XML fragment which will be
substituted with the matched part of the element content. There are two columns
for the regular expressions which determine the left and the right context of
the match - Left Regular Expression and Right Regular
Expression. The last column is for comments on the rule.
The content of the table cells is not checked for
consistency before the compilation.
Context Check Order
This option gives the user a possibility to choose which
context will be checked first - the right or the left. In this way a preference
over rules can be defined. The two orders are: first the left context, then the
right one (Left->Right); the right context, then the left one
(Right->Left).
Variables
This table contains the definitions of the variables for the
grammar. Each row of the table represents the definition of one variable. Each
definition consists of the following elements: name of the variable
(Name) which is a capital Latin letter; The positive constraints
for the variable are given as a list of token descriptions in the cell
Positive Values. If the cell is empty, then the variable is not
constrained and can have any non-empty value; The negative constraints for the
variable are given as a list of token descriptions in the cell Negative
Values. All the values that can be described by some of the token
descriptions in the cell are forbidden as values of the variable. If the cell is
empty, then the variable is not negatively constrained and can have any value
described be the positive constraints; Like every token description in the CLaRK
system, a variable also can match several strings starting from a position.
The user has a possibility to define which match to be chosen. There are two
options for the Match cell: Longest - in this case the
variable is assigned the longest possible value, and Shortest - in
this case it receives the shortest possible value.
The management of the variable table is done by a pop-up menu which appears when the user
clicks with the right button of the mouse on the cells in the table. The possible choices are: Insert
row which allows the user to define a new variable; Delete row
which allows the user to delete the definition of a variable; Edit Cell which allows the
user to modify the list of token
descriptions for the positive or the negative constraints for the variable; Up and
Down allow the user to rearrange the list of the variable descriptions for her/his own
convenience.
When the user edits the constraints for a variable (Edit Cell) the following dialog appears for the positive constraints:

or for the negative constraints:

In both cases the constraints are represented as a list of
token descriptions (one per line). The user has the possibility to enter a new
description by the button Add Value - in this case a text edit field
appears and the new token description has to be entered in it. The user can
delete some token descriptions by selecting them in the list and clicking on the
button Remove Value(s). The buttons OK and
Cancel are used for the acceptance or rejection of the changes.
Buttons
The buttons at the bottom of the dialog give the user the
possibility to save the grammar - Save; to compile the grammar -
Compile. In case of errors in the grammar a corresponding error
message appears; to exit the dialog - Exit. The system prompts for
unsaved changes.
As it was described in the Grammars menu, the definition and
application of a grammar are separated within the CLaRK system. In the Grammars menu the user can find a description of the
grammars and how to construct their definitions in the CLaRK system. In this section the user can
read how to apply a grammar over one or several XML documents.
Here is the main dialog of the Apply Grammar tool:
The application of a grammar requires the following types of
information: the name of the grammar (text field Grammar), the target of
application (text field Apply to), how the input to be prepared (the
second row of options: Tokenizer, Filter,
Normalize, and Element Values), how the rules of the
grammar to be applied (the third row of options: the match options for the left
context regular expression (Left), for the main regular expression
(Body), for the right context regular expression
(Right)), and whether the context can be backtracked (Context
Backtracking). A combination of all these options is called a
grammar query.
Additionally, the dialog allows the user to consult the
definitions which are connected with some DTD in the system. Generally, if in a
particular grammar query some of the necessary information is not presented,
then the system checks the corresponding information connected with the DTD
of the document. This can be done by the menu Features.
The user can save the current settings of a query as an XML
document by choosing Queries check box. Then the user has the
possibility to save the query with some comments in the Info text
field. Also, the user can select a previous query from the list of queries. The
grammar query XML documents are saved in the group
SYSTEM:Queries:Grammar.
Also, the user can specify whether the grammar to be applied
over the currently opened document or on some documents stored internally in the
system. This can be done by choosing the Multiple Apply. In this
case the user can select several documents which to apply the grammar to.
At the bottom of the dialog there are three buttons:
Apply for application of the currently stated query (also one
loaded from the XML representation); Close for closing the dialog;
and Select for navigation over the current document and manual
annotation (see below).
Element Values
When a grammar is applied over a content which contains XML
elements, the system converts each such element into an element value. How
exactly this conversion is performed is stated as Element Value definitions.
These definitions can be connected with a particular DTD, but the grammar query
allows the user to change the DTD settings and to define them locally into the
query. Each element value is connected with a element tag and a sequence of
XPath expressions (called keys) which define the sequence of tokens or elements
for the element value. See below for examples. The dialog of the Element Values
editor is as follows:
Each row of the table represents an element value for an
element. The first column represents the names of the elements for which element
value is defined. In this case they are w and pt. The
Keys column contains the keys for each definition. In this case the
value of the element w is the text in its ph element and
the text in its ta element. For the pt elements the
definition says that the element value is their text content.
The user can edit the value of the right column by selecting
the Edit Cell item from the pop-up menu which appears when the user
presses the right mouse button over the cell (The menu is visible on the screen
shot). There are two modes for the element value Tool and
User. In the following screen shots the w
element is shown in both modes:

Each row of the table represents one key. All keys in the
table define the value of one element. One key consists of a key name (option),
a key value (XPath expression), a normalize option and a tokenizer name. If the
element value is in Tool mode, then the normalize option and
the tokenizer are taken from the grammar. The key in the table is called Grammar
Key. It can be saved and loaded into the system memory. This is done by selecting
Load Key and Save Key menu items in the context menu
shown when the user right clicks over a cell in the table. The user can also add
remove keys with items from this menu. The normalize option and the tokenizer
name determine the input word, which is created for the elements when the grammar is
applied. An interesting option here is "No Tokenizer" tokenizer. If it is
selected, then the text nodes are treated as one token. When the OK
button is clicked all XPath expressions in the table are compiled.
The element values are calculated in the following way:
If the XPath expression selects textual content or the
value of an attribute, the corresponding text is tokenized by the relevant
tokenizer. Then the value is a sequence of tokens.
If the XPath expression selects one or more elements then
each element is represented as <tagname>, where
tagname is the tag for the element.
If there is no element value definition for the element
then it is represented as <tagname>, where
tagname is the tag for the element. The difference from the
previous case is that if the element value is defined by self::*,
then the element value will be <<tagname>>.
Select
This button allows the user to apply a grammar in an
interactive mode. In this mode the grammar is executed on the current document
and for each match it stops and allows the user to see the selected content and
to perform some actions like: to go to the next selection (Next
button), to go to the previous selection (Previous button), to add the
return mark-up to the content (Mark button), and to exit the mode
(Exit).
This tool is a means for applying a set of Grammar Queries in a row. The application
itself is done in a cascaded grammar style, i.e. the output from each grammar is an input for
the next one. The result from the last grammar is a result from the whole tool. The advantage
of this tool is the better efficiency which is a result from the fact that the input for the grammars
is prepared only once. Otherwise, the input should be prepared (preprocessed) each time a single
grammar query is to be applied. This can be crucial for huge amounts of data.
Here is what the Grammar Groups dialog window looks like:

The tool dialog basically represents a list of Grammar queries. The user can Add Grammar
Query to the end of the list and/or Remove Grammar Query from the list by using
the buttons on the right side of the panel. The order of the different grammar queries can be changed
by selecting a query and dragging it to the desired position.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported
here.
The Regular Expression Constraints is a means for setting restrictions on the content
of certain nodes in a document. The restrictions are set as regular expression patterns.
The syntax of the patterns is the same as the one in the Grammar
tool. The nodes whose content the constraint will be applied to are selected by an XPath
expression. The selected nodes must only be of type Element (as no other types of
nodes can have content). A node satisfies a certain constraint if its content matches the
pattern given in the constraint. The application of a constraint gives the user the
possibility to navigate either through the nodes which satisfy the constraint or through
the nodes which do not satisfy it.
The constraint application is similar to a DTD validation of certain nodes of a
document, but here we offer a more powerful instrument. On the one hand, the nodes
which the constraints are applied to are selected not only by name (as it is in the DTD), but
depending on the context in which they appear (XPath determined context). The context can
be an absolute or relative document position, based on properties of the selected nodes or
nodes relative to them, properties located in other documents, etc. The node selection
qualification uses the full expressive power of the XPath engine, implemented in the
system. On the other hand, the regular expressions of the Grammar allow writing
patterns not only at the level of text nodes, but at the level of tokens within the text
nodes as well. Even more, the user can specify a tokenizer which
will be used for segmenting the text and a filter to discard
the unmeaningful tokens during application. In the patterns the user can write wildcard
symbols in token descriptions in the same way they are written in the Grammar tool.
With the help of these constraints the user can cover some of the features of XML
Schema usage, by defining patterns for text nodes.
Regular Expression Constraints Structure
Each Regular Expression Constraint consists of 2 main parts and 3
additional (optional) parts:
- Constraint name (obligatory) - a unique identifier for the constraint in the
system;
- Regular expression (obligatory) - a valid regular expression which represents the
constraint over the nodes' content;
- Default XPath expression - it is an XPath expression defining the selection of
the nodes to be processed by the constraint. This expression appears as a default text in
the appropriate specification area;
- Tokenizer - it is used when the constraint tests text nodes' content. If the
tokenizer remains unspecified, then the processor takes the tokenizer, which is specified
in the DTD of the current document;
- Filter - it is used to filter the tokenizer categories when the text nodes'
content is tested. In other words, all filtered tokens are discarded from the
selection before passing it to the constraint engine.
This section is responsible for the regular expression constraint management.
Here the REC can be created, modified, removed, saved as a file and loaded from a file.
Here is a picture of the dialog window:

The left side of the window is a table with all RECs in the system. The first column
contains the names of the constraints. The second one contains the regular expressions
for each of the constraints. Having selected a row in this table, the user can apply a
manipulation over a constraint by using the buttons on the right.
Description of the buttons on the right:
- New - creates a new Regular Expression Constraint. Having pressed the
"New" button, a new constraint editor window appears on the screen (for more
details, see below);
- Edit - the currently selected constraint is opened for editing in a new editor
window;
- Remove - removes the currently selected constraint in the table. The removal is
preceded by a confirmation message;
- OK - updates the current changes in the constraints and closes the manager
window;
- Cancel - closes the manager window without saving the changes (if any) in the
constraints;
- Save To File - serializes all the RECs into an external file in an XML format.
This function can be used for two main purposes: back-ups and interaction with external
applications. The description of the output XML file (the DTD) can be found in the file: regConstraint.dtd;
- Load From File - loads the REC(s) from an external file. The external file must be
an XML document, valid with respect to the DTD in the file: regConstraint.dtd;
Regular Expression Constraint Editor
Here is the interface view of the editor window for the REC:

The last three fields are optional. The tokenizer and filter lists contain all
tokenizers and filters defined in the system. The regular expression may consist of: tags,
token categories, token values and token value templates (wildcard descriptions).
The actual applying of the Regular Expression Constraints can be performed in
two ways:
- by selecting a node from the tree panel and choosing a constraint;
- by selecting a set of nodes with the help of an XPath expression and then applying a
certain constraint on each of them;
Here we describe the latter case. The user chooses 'Apply Regular Expression
Constraints' from the menu Tools/Constraints/Regular Expression Constraints/Apply
Regular Expression Constraints. Then the following dialog window appears:

The first input field Select nodes contains the XPath which is evaluated in
order to select nodes for the constraints operation. If the default XPath expression is
specified for the constraint, then it appears in this field as a default text.
The second field selects by name a constraint to be applied.
The last two fields are activated when the current constraint tokenizer and filter are ignored
and new ones have to be defined explicitly.
Having pressed the Apply button, the XPath is evaluated and a set of nodes is
selected. Then for each of them the constraint is applied. If the node's content satisfies
the constraint, then the node is marked as Valid. Otherwise it is marked as Non
Valid. In this way two groups of nodes are formed and each of them can be observed
separately. Here is a picture of the navigation panel window:

In this window the user can change the group under observation by using the two radio buttons.
Pushing Next and Previous buttons the user changes the current selection in
the editor. On the top of the window there is some information about the constraint and
the nodes which satisfy or do not satisfy it. For the example above, the pattern
'$NUMBER+,$SPACE*' concerns the content of text nodes. The items which satisfy the
constraint with this pattern are all element nodes whose text content is a sequence of one
or more tokens of category NUMBER, followed by zero or more tokens of category SPACE.
Thus, strings which match the pattern are: "1234", "256 ",
"666 ", etc.
The constraint engine is a means for setting restrictions on the content or other
related information of nodes in XML documents, which cannot be expressed by the DTD.
The nature of the restrictions is based on the existence of certain values (tokens and/or tags) at
certain places. The constraints of this type specify the pieces of information which are restricted
and define the set of admissible values for each of them (usually by pointing to a location they
are stored in, or by encoding the values themselves explicitly).
Value Constraint Structure
In general a value constraint consists of two parts: a target section and a source
section.
Target Section
In this part one can find a description of the nodes which the constraint will be applied to.
The target nodes for a constraint are selected by an XPath expression evaluated on the
document which the given constraint is to be applied to. The result from the evaluation is expected
(required) to be a node set with nodes compatible with the specific constraint application.
If the result set contains nodes of types other than the required ones, they are automatically
excluded (example: the selection contains text and attribute nodes, but the constraint checks
the child nodes of its targets).
This way of target selection uses the full expressive power of the XPath language in order
a context dependency to be expressed.
Source Section
Here the possible values for the target nodes
(selected by the previous section) are defined. The possible values are tag names and
tokens depending on the type of the constraint. The source list can be selected by an
XPath expression or by typing the choices explicitly as an XML markup. If the selection is
made by a relative XPath expression, then the current target node is taken as a context
node for the constraint. If a text node is selected as a source, then its text value is
tokenized and the tokens are added to the source list, excluding the node itself. It is
possible that the source for the constraint is an external document. The only requirements
in such cases are the following: the external document has to be in the internal database
of the system and the XPath expression cannot be relative.
There are four types of value constraints, currently supported by the system. They are
distinguished by their target and the way of their usage. Here is a description of each
value constraint separately:
- Parent Constraint
This type of value constraint sets limits on
the possible parents of a node. There are two ways of applying this constraint type: by
changing the parent of a node (local) or explicitly running the constraint engine (global).
The first possibility is changing the parent of a node (or a set of nodes at one level).
The list of all the relevant parent nodes can be restricted further by applying other
constraints. The final list contains the intersection between the source of the
constraints and its former content. If the operation - changing the parent of a set of
nodes - is performed, then all compatible (parent)constraints are applied.
The second possibility is running the Constraint Engine. It works in the following way.
First, the targets are selected (by their tag names and an XPath restriction). Then the
source is compiled. If there is more than one choice, the user is asked to select one
option from a list. If the choice happens to be exactly one element, it can be automatically
inserted as a parent of the target. The action of a constraint depends on the Application Mode set for the constraint.
The source list of each constraint must contain only tag names. All tokens in the list
are ignored.
- All Children Constraint
This type of value constraints sets limits on
the names of a node's children and the content of its text children. All children, that
are tags, must have names coinciding with the name of some node from the source list. Then
all the data in text children is tokenized and a list A of tokens is formed. After that
all the data in text nodes in the source list is tokenized and a list B of tokens is
formed. For every token in A there must exist a token in B such that the values (not
categories) of A and B are equal. This type of value constraints can be applied (checked)
from menu item Apply All Children or from the toolbar button. The list of all
invalid nodes according to the constraints is given in the Error message area together with
the rest of the errors (if any). The user is given a possibility to navigate through all invalid for
the constraints nodes.
- Some Children Constraint
This is a special type of a value
constraint, because its main task is not only to set limits on the node's content.
Instead, it can be used for a value restriction when the operation inserting a child in
a node is performed. This constraint type is not applied each time a new node is
inserted. These constraints are used separately. Here the target node is the node where
the insertion takes place. The constraint is blocked when:
- there is a child of the target node that is a tag and there is a node in the source list,
such that both nodes have identical names.
- there is a text node in the target node that has a token, whose value equals the value
of a token in the source list.
To sum up, when there is a non-empty
intersection between the source list and the target node's content, the constraint is
satisfied and there is nothing more to be done. In cases when the source list is empty and
the target content is also empty, then the constraint is satisfied.
When the source list is not empty and there is
no intersection with the target's content, the user is offered a list with the possible
values from the source list for the target node. The user can choose one item to insert.
The action of a constraint depends on the Application Mode
set for the constraint.
- Some Attributes Constraint
This constraint is very similar to the
previous one. The only difference is that the target here is an attribute of an Element
node. Also the target selection includes a selection of an attribute defined in the DTD
for the selected tag name.The action of a constraint depends on the Application Mode set for the constraint.
The Value Constraints have two modes of application, concerning the treatment
of the target nodes:
- Validation Mode - the constraint points to the target nodes which do
not satisfy it, showing all the possibilities for the specific places. On demand the user
can insert a value from the list of possibilities.
- Insertion Mode - The constraint points to the target nodes which do not
satisfy it and expects the user to select one of the possible values to be inserted. If
the list of possibilities for a certain place contains only one entry, it is automatically
inserted. Then, if the constraint is of type Some Children, the user can
specify the way of the new value insertion. If the new value is a token, the user can
specify the position in the content where it must be inserted. The first position is
denoted by 1 (not 0). If a position is not specified the new value is inserted as the last
element. The elements in the content which are counted are either not filtered tokens or Element
nodes. If the new value for insertion is an Element node the counting of the
content entries is done in terms of DOM structures (Text nodes, Element
nodes).
The screen shot on Fig. 1 is the dialog window of the value constraints editor.

Fig. 1
The editor is separated into 5 sections which are responsible for different parts of the
constraint definition. On the top of the window there is a Summary
information panel which shows the current constraint settings (Type, Mode,
Target, etc.). The sections are:
- General (Fig. 1) - here the user supplies an unique Constraint name
for the constraint (free text) which is obligatory. The constraint is identified by this
name later in applications.Optionally, some additional Constraint descriptions
can be written in the second text box. In this section one of the most important aspects
of the constraint is defined - the Type of constraint. This
determines the whole behaviour of the constraint. The options are: Parent, All
Children, Some Children and Some Attributes (described above).
- Options (Fig. 2) - this section offers several options related with the
application of the constraints. The options here do not concern constraints of type All
Children.The first part of the section defines the Application Mode
in which the constraint will be applied. For Insertion Mode the user can set an
insertion position and a token Separator (when
needed) (Note: Position and Separator options concern only Some
Children constraints). The position must be a positive integer, where 1 denotes the
first position. Leaving this field empty means 'last position'. The separator can be an
arbitrary string. For details see Application Mode.
The remaining options in this part are as follows:
- Show status before - indicates the number of the target nodes the constraint will
be applied to, i.e. nodes count before the real application;
- Show status after - indicates the number of the target nodes the constraint has
already been applied to, i.e. after the real application;
- Disable Intersection Checking !!! - disables/enables the checking
whether the constrained target nodes data has common parts with the source data of the
constraint. Disabling this checking allows the user to insert more than one possible value
in Insertion Mode during the application;
- Restrict to a single choice run - this option is relevant when a constraint is applied
on the current document in an insertion mode. When selected, it restricts the execution to the cases
when a single value is determined by the source evaluation, i.e. the constraint works only for the cases
where no user decision is needed in application. If more than one entry is selected as a source, the
corresponding target is skipped. The tool behaves as if it works in Multiple Apply mode, but
on the current document.
- Prompt for save on each: ... applications of constraint - this item is
used for making backups of the current state of the document while applying the
constraints. In order to use this option, the check box must be marked and in the text
field a number must be entered. It indicates the number of the successful applications,
after which the system prompts the user to save the document.

Fig. 2
- Target (Fig. 3) - here the definition of the target nodes for the constraint is
given. In field Target XPath the user is expected to supply an XPath expression
which will determine the target nodes for the constraint. The XPath expression must return
a node-set in which the nodes must be of a proper type (depending on the constraint type).
In this XPath expression the user may (if needed) define context restrictions for the targets.
If the current constraint is of type Some Attributes the user must supply
a valid Target Attribute name. This field is disabled for all other types of
constraints.

Fig. 3
- Source (Fig. 4) - this section defines the source list for the constraint. The
text field content is either an XPath expression, or an XML markup. It depends on the
radio button, which has been currently selected for the Source Type. If
the source type is XML Mark-up, then the content of the text field is XML.
Otherwise it must be an XPath expression. If the selected type is Local Document,
then the XPath expression is evaluated for each target node as a context. If the type is External
Document, then the choice box gets enabled and the user is expected to choose a
document. The XPath expression is evaluated on this document and the root node is the
context. In the latter case it is expected for the XPath expression to be absolute.

Fig. 4
- Advanced (Fig. 5) - here a tokenizer can be activated (Set a Tokenizer)
for the constraint or it can be blocked in order not to treat the text nodes as a set of
tokens but as a whole. Also a filter (Use Filter) can be set in order to
exclude some "garbage" categories as separators or others from the source list.
Another restriction can be set here by defining token value and category Templates.
The templates are defined in the same way as these in the grammar tools (using @ and #
symbols for wildcards). Another facility, which can be relied upon here, is the Help
Document. This option ensures the following possibility: while listing the different
choices, the user can get brief information about the meaning of each choice. This
information must be stored in an internal document. Its structure is described in a DTD in
the file: resources/dtds/helpFile.dtd. The information about a given choice appears in the
status bar of the editor when the mouse pointer is over the choice.

Fig. 5
In the preceding section a description of the Constraint Editor was presented. It is
evoked whenever a change on a Value Constraint is needed or a new constraint is
defined. The Value Constraint management is handled by the following
manager dialog window:

Within the CLaRK System this module can be evoked from the menu: Tools/Constraints/Edit
Value Constraints.
The Value Constraint Manager is an
Entry Manager
with additional buttons. It contains all of the available value constraints arranged in a tree hierarchy,
some of their features (Description), buttons for management of the constraints.
There is an additional context XPath text field located from below of the table,
which determine the context for each constraint group and is used for applying Constraint groups.
First, a context node is selected and then all the constraints from the group are applied within this
context. This XPath value can be changed by pressing Edit button and entering the new value.
The manager window consists of two main parts:
- The panel on the left. It contains the tree representations of the group hierarchy. When the user
selects a node in this tree, the content of the corresponding group is loaded in the component on the
right side.
- Current group monitor. This is the panel situated on the right side of the window. It is
a list with the content of the currently selected group in the tree. The list components,
which are in blue color, are sub-groups. The other ones are the constraints included in
this group. They are colored in black or red, depending on whether they are valid or not,
according to their DTDs. The user can sort all the constraints in a group by clicking on the
Name
column of the table header. It is possible to rearrange the constraints in a group by simply using
drag-and-drop technique, i.e. pressing a constraint and moving upwards or downwords until the
desired position is reached.
There are six additional buttons which can be used for
modification the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to give a
name for it;
- Remove - removes the selected constraints and/or groups from the list. The removal
is preceded by a confirmation message. If the selection includes sub-groups, they are also removed
with their entire contents.
- Rename - give a new name of the selected constraint from the list.
- Copy - save the data of the selected constraint from the list with different name
given by the user.
- Add Constraints - gives a list of all constraints which are not present in the
current group. The user is expected to choose one or more constraints to be included in the
current group.
- Delete! This function can be used for removing constraints from the internal
constraint database. It can be applied only to single constraints, not to whole groups. Groups are
excluded from any selections. The removal of constraints is preceded by a confirmation message.
The constraints to be removed, are excluded from all the groups they may belong to.
Navigation in the group structure can be made also in the panel on the right.
When the user wants to see the content of a certain sub-group of the current group, s/he just has
to perform a double click on the desired sub-group. This will change the current group to
the new one. This represents the movement from a group to a sub-group. The movement in the
other direction is also possible. For each constraint group (excluding the Value Constraints), a special
sub-group is included, named: ". .". By performing a double click on it,
similarly to most file systems, the current group is changed to its parent one.
The Value Constraints Manager also provides a list of all the constraints, no
matter which group they are included in. The following information appears for each
constraint: date - when it has been last modified; if it is a query, which tool it refers
to; and which is its DTD. The user can sort all the constraints by clicking on the Name
column of the table header. When selecting constraints, the right button of the mouse is
used to visualize the Pop-up menu with the following operations on selected constraints:

- Info - This item shows the following information about the selected constraints:
- constraint name
- constraint size
- constraint's dtd name
- whether the constraint is valid
- group of the constraint
- Add In Group... - This item shows a dialog with the hierarchical structure for the groups
in the system and the user can choose a group in which to place all the selected constraints.
- Delete! - It is described above.
- Rename - It is described above.
- Copy - It is described above.
Under the table with the available constraints there are 5
buttons (New, Edit,
Apply Constraints, Cancel,
Done) and 1 menu (Load / Save Constraints). They can be used to manage the constraints.
Buttons:
- New - creates a new Value constraint by calling the Constraint Editor;
- Edit - edit the selected Value constraint by calling the Constraint Editor;
- Apply Constraints - first saves the changes on the constraints (if any).
Then switches from the Value Constrains Manager dialog to the
Apply Constraints
dialog and allows the user to apply some of the value constraints and constraint groups;
- Cancel - closes the dialog window without saving the changes on the constraints
(if any);
- Done - closes the dialog window by saving the changes on the constraints (if
any).
Menu Load / Save Constraints:

This menu gives the user a possibility to store and load
value constraints in XML format into/from a file. It has the following items:
Load constraints from file - the user is asked to
choose a file which contains the value constraints in XML format. Then the system reads
the file, interprets it as CLaRK value constraints and stores them in the value constraint database
of the system in a table format. Such a constraints can be modified within the system.
Save constraints to file - the user can save value constraints into a file.
The file is an XML document and it can be modified out of the system.
Save constraints with groups to file - the user can save
value constraints into a file and the group structure for the selected constraints.
The file is an XML document and it can be modified out of the system.
This is a tool specialized in applying Value Constraints on the current
document or on a set of documents. The user is expected to create a list of single Value
Constraints or whole Value Constraint Groups. The constraints and groups are
applied in the order they appear in the list. They can be reordered by simply using
drag-and-drop technique, i.e. pressing a list entry and moving upwards or downwords until
the desired position is reached. Each entry in the list contains: either a constraint name
followed by constraint description in brackets (for a constraint) or a group name followed
by the '(group)' suffix (for a constraint group).
The user can modify the content of the Constraints List by pressing the
following buttons:
- Add Constraint - appends one or more constraints to the end of the
list. The user is shown a list of all Value Constraints in the system, including
the ones which are already in the list. In this way one constraint can be included more
than one time (if it is needed for certain processing);
- Add Group - appends one or more constraint groups to the end of the
list. The user has to choose from a tree of all constraint groups in the system using Value Constraints Groups Hierarchy dialog;
- Remove - excludes the selected entries from the list (constraints or
groups of constraints). The exclusion is NOT preceded by a warning message.

This tool supports two modes of application: to the current document and Multiple
Apply mode. For details see Tool
Application Modes. If the tool is run in Multiple Apply mode there is
one significant difference in the application: if a constraint uses Insertion Mode,
a real insertion is performed only if the possible source value is one. In case there are
more choices - some human intervention is needed. But the Multiple Apply does
not allow it.
The user can also save or load the current tool settings, i.e. XML Tool Queries are supported
here.
This constraint type in general restricts numeric values related with nodes and their properties
within XML documents. Such values can be: a number of occurrences of some specific elements
within the content, values returned by XPath functions or operators. The target
constrained values are produced from the evaluation of an XPath
expression. This XPath is evaluated according to the result from the evaluation of Context
XPath expression which determines the nodes which the constraint will be applied to. Independently, for each
initially selected context node, one XPath result is produced. Depending on the result, a
numeric value is formed as follows:
- node-set - the number of the nodes;
- string - if the string represents a valid number, the new value is this
value. Otherwise the Not-A-Number identifier is produced;
- number - the number itself;
- boolean - if it is a true value - 1, otherwise 0.
Note: if the newly produced value is the Not-A-Number value then the
corresponding context node does not satisfy the constraint.
A context node satisfies a constraint if the result numeric value of the XPath is
within the range of Minimum Size and Maximum Size
values. The latter two values can be either numbers or XPath expressions which are
expected to return numeric values. If an XPath returns a non-numeric value, the system
tries to transform it automatically to a number. In case the Minimum Size value
for a constraint is not defined or the defined value produces Not-A-Number value,
the system assumes that at this place there is no limitation and all target values which
are under the corresponding maximum satisfy the constraint. By analogy, if a Maximum
Size value is omitted or it is Not-A-Number, then no upper limitation is
assumed. If an XPath expression is used for setting minimal or maximal limit, its context
for evaluation is each initially selected context node. In this way the boundaries of one
constraint can vary for different contexts.
The Number Constraint Manager dialog:

In the example above, the fourth constraint has no upper limit. The fourth column (Use
It?) is responsible for the activation/deactivation of the constraints. It
becomes a necessity when the user would like to apply only a certain subset from all the
constraints. Applying the (active) constraints can be done by pressing the Apply
button. This button is disabled when there in no document in the editor. After applying
the constraints, the user receives information about each applied constraint and the
number of the satisfied nodes (contexts) as well as the non-satisfied ones . In the picture
below there is an example result dialog:

Here the user has the ability to navigate through all satisfying or not satisfying a
certain constraint nodes. This can be done by selecting a row in the result info table and
using button Details. The user is shown a small navigation dialog which
allows successive traversal of the nodes in forward or reverse direction. Here follows a
picture of the navigation dialog from the preceding example picture:

The dialog contains several sections:
This function applies an XSL Transformation either on the current document in the system editor or on a set of selected internal documents. In the case a transformation is applied on the current document, the result XML document is loaded automatically in the system. If the transformation which has been applied does not produce any result, a warning message 'No result produced!' appears. Otherwise the user is asked to supply a DTD for the result document. In case a transformation is applied on a set of internal documents the user has to specify a DTD and result document names in advance.
Another not traditional application of the XSL Transformations is the so called Local Transformations.
Their application is performed in the following way: a set of nodes are extracted from a document (current or internal).
For each of them, independently an XSLT is performed and the result (if any) is incorporated back in the original
location where the extract was taken. Thus, no new result document is created but the original document is modified.
The extracted nodes to be transformed are selected either by an XPath expression or by direct selection in the tree.
The result from each transformation application is a Document Fragment (DOM) which substitutes the context node for which it is produced. The context node is removed from the tree and all sub elements of the fragment are inserted at the position of the context. The application of the transformation is followed by a result information message. It contains four pieces of information:
- The number of nodes selected by the XPath expression as contexts;
- The number of context nodes which have been replaced by a result fragment;
- The number of context nodes to which the transformation did not produce any result;
- The number of context nodes which have been lost during applying a preceding transformation. This can happen when a node and its descendant node are selected as contexts and the transformation has succeeded on the parent node.

If no transformations are available in the system, a warning message appears. User can apply XSL
Transformation by means of the Multiple Apply module, or save the current settings for further use from
Queries module.
For details about the management of the transformations see module XSLT Manager.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported
here.
This component is responsible for the management of all XSL Transformations in the CLaRK System.
Although the documents containing XSLT are regular well-formed XML documents, they are treated as a
separate class of read-only documents. They are stored in a special separate list and only tools using XSLT can
access it. The acceptable operations on the list are adding, removing and overwriting a list entry
(XSLTransformation document).
Here is the dialog window of the manager:

Buttons:
- Add New - Opens an Internal Documents Manager window, offering the user to select (a)
document(s) containing XSL Transformations for the list of all transformations. Each document is tested for
validity using the XSLT Valuator module. If a document is not a valid XSLT it is not included in the list and a
message describing the error appears. Once included in the list, this XSL Transformation can be used in all
system tools which deal with XSLT.
- Add Current - Adds the current document of the system editor to the list of all available XSL
Transformations. The document is cloned and thus editing will not affect the transformation. If the
transformation needs modification it has to be extracted from the manager and later added by using the button
Open In Editor.
- Remove - Removes the selected transformation(s) from the list of all XSL Transformations. The
transformation data is lost. The removal is preceded by a confirm message. Multiple selection is allowed
here.
- Open In Editor - Loads the selected transformation(s) from the list into the system editor as
XML documents;
- Apply - Applies the selected XSL Transformation to the current document in the editor. Here
only a single selection is allowed. If there is no current document this button is disabled.
- Close - Closes the XSLT Manager dialog window. All changes on the list of
transformations (addition, removal) are updated.
This option validates the current document in the editor if it can be used as an XSL Transformation. This
option can be used when a new transformation document is created (or imported) in the editor. The XSLT
Validator checks the content and if there are no errors, an information message is shown. Otherwise, an error
message appears and the location of the error in the document is pointed to.
Concordance - a system tool for information extraction. It allows an extraction of certain units (words,
phrases, etc.) within bigger units (sentences, paragraphs, etc.). The result extraction is shown in a table where
on each row a result is shown. The searched items, the left and the right contexts are distinguished in separate
columns. The tool is implemented on the basis of the XPath engine, regular grammar engine and a sorting
module.
The field at the top of the dialog (Define Context) is used for defining the context nodes within
which the extraction will be done. The user is expected to supply an XPath expression which after evaluation
returns a node set. The context for evaluation is the root node of the document. The user can perform two types
of concordance extraction: grammar based and XPath based. They will be described in details in the following
paragraphs.
The result from the concordance is stored in an XML document and for convenience it is shown in a table.
The structure of the XML document is a sequence of <L> elements standing for the found items
(lines). Each concordance line has the following XML structure:
<L>
<LC> the left context </LC>
<I> the data we are searching for </I>
<RC> the right context </RC>
<!-- user commentary -->
</L>
and the corresponding table representation:

When the user sees a result from the concordance in a table s/he must always have in mind that there is an
XML structure behind the dialog window, especially when a table rows sorting has to be performed.
Grammar Concordance Search
This type of concordance uses regular expressions patterns for searching items in other items. The
patterns are defined as grammars in the Grammar tool. The
items which match the patterns (tokens, XML tags or mixed) are shown in the result table (document),
accompanied by the context in which they appear. Initially the context is determined by the XPath expression
mentioned above. Further restrictions on the context can be set by another grammar pattern, i.e. the target
searched items will be extracted only from items matched by another grammar pattern as contexts discarding
everything else.
If the Text only (The mark-up will be ignored)? option is selected the concordance
engine will ignore the mark-up inside the initially selected contexts while checking. Here follows an example
how the mark-up can be ignored during the data extraction. Let us have the following simple XML document as a
source of extraction:

If the target item of extraction is the word loves within the context of the TEXT element
the Grammar pattern must only describe the word itself without specifying that it appears in any mark-up (in this
case in the content of tag verb. Here is the result from the query 'loves' within the context of
'TEXT':

If the target of interest is the sequence John loves Mary within the context of the TEXT
element, the query pattern will be: "John","loves","Mary". Although these three words appear
in different XML structures, after filtering the mark-up the result will be:

Here is how the Concordance dialog window appears in a configuration set for a Grammar
concordance search:
The Concordance dialog offers three sub-dialogs for setting a grammar search query corresponding to
three levels of complexity, each supplying different sets of options. Each of the three dialogs is accessible by
choosing the corresponding item from the Usage Mode panel. The possible items are:
- Simplified
This mode of usage offers a very basic set of options, which is convenient for
relatively simple search queries. The user is expected to supply a Query String
which must be a regular expression (the same syntax as in the
Grammar tool). For text preprocessing the user can
specify a tokenizer, a filter and a normalization.

- Normal
In this mode the user can use a previously defined grammar from the
Grammar tool. Similarly to the grammar application
the user here can define Element Values for performing a more flexible search. For text
preprocessing the user can specify a tokenizer, a filter and normalization.

- Queries
Here, the user does not specify anything directly related to the search process. The thing
which is needed in advance is (at least) one grammar query (see the
Apply Grammar tool description) which in turn
requires a compiled grammar. In this dialog the user just points to the Search Query to
be applied. Additionally a restriction on the context can be set by selection of a
Restriction Query which determines the context for the Search Query. In this
case the context is formed by the output of the restriction query and, if there are initially selected
nodes by the XPath expression, for which the restriction grammar does not match anything, they
are excluded from further processing.

XPath Concordance Search
Searching items in this mode of concordance extraction is based on XPath queries within initially
selected context nodes. The content which will be shown in the Item column will be a result from the
evaluation of the XPath expression from the field Search Elements. If for a context the returned result
contains more than one node, each node will appear in a separate row in the table. For each single result node
the XPath expressions from Left Context and Right Context are evaluated to form the
content of the corresponding table cells for the contexts. If any of these two fields does not contain an XPath
expression, in the corresponding table cell the whole content before/after the found item will be shown.
Here is the Concordance dialog window in a configuration set for XPath concordance search :

Concordance Options
The options which appear in the both modes of extraction are:
- Text only(The mark-up will be ignored)?
This option works only in mode Grammar Concordance Search. As it was described before,
this option filters the XML tags from the input data for the concordance search engine and leaves plain
text.
- Add number attribute ?
This option enables adding attributes number to each single result tag <L>
with values, enumerating each result item.
- Add source attribute ?
This option enables adding attributes source to each single result tag <L>
with values, showing the source documents which the extractions were taken from.
- Add path attribute ?
This option enables adding an attribute path to each single result tag <L>
with a value, which is an (abbreviated) XPath expression showing the absolute address of the corresponding
located item in the original source document.
The Table View tool is created to represent the information extracted from the concordance tool in a more
readable table form. Each line of the table represents one line of the concordance result.
If the user wants to use this feature, s/he has to open an XML document which is produced as a
result from the Concordance tool. Heaving the Table View menu item selected, the system tries
to detect the required structure in the currently opened document in the system editor. In case of failure, the
system produces an appropriate error message. Otherwise, the document content is shown in a table (the picture
below). The required document structure of the input for the Table View is described in the Concordance tool.
The data in the "Context" columns does not represent the whole context but only the amount of
data that can fit in the column length. At the beginning it is only 30 symbols. To increase the context the user
should press the settings button and from there to determine the context in symbols. The user can also set the
width of the comment column. If the user wants to see the context without expanding the column data s/he can
do it with right click on the "Left Context" and "Right Context" column. If the user
wants to add some commentary to a concordance line s/he can do so by filling a value in the
"Comment" column or by right clicking a row in the "Item" column. To navigate faster
through the table the user can rely on the combo box at the top for accessing a row.
The user can also sort the lines of the table. To do so, s/he must select which column to apply each sort
keys to (which element of the concordance line will be the context LC, I, RC or comments). If no column is
selected then the key will be executed with the line element for context.
A useful option of the Table View is the "Edit Layout". The user can filter the tags that are
shown in the table. For example, if the POS information is separated in a tag, the user can hide it in order to view
only the text.
The extract tool task is to extract nodes from a document or from multiple documents and to save them as a
new document. The document data extraction is based on XPath expressions. The text field at the top of the
dialog is used for defining an XPath expression which selects the elements in the document(s). The context node
for this evaluation is the root node of the document(s). The result from the extraction is an XML
document in which all extracted nodes are children of the root element (This element is named
"Extract" by the system).
The Include subtree option allows the extraction not only of the selected nodes but
the entire subtrees below them as well.
The Create result tag option allows each extracted node to have a parent element which is used to
separate the different results. For example, if we extract only text node, then in the new document all the text
nodes will be concatenated. If the Create result tag is selected, then for each result node there will be
added a parent element. The name of the parent element is taken from the corresponding text field.
If Create source attribute option is selected, then the extract tool adds an attribute with the source
document name to either the auxiliary element or the root of each result root element. The name of the
attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no
auxiliary element is added, this option does not change anything.
If Create path attribute option is selected, then the extract tool adds to each result structure an
attribute with a value which is an XPath expression expressing the location of the result in the original source
document. In other words if this XPath expression is evaluated on the source document, the result will be exactly
the extracted result node. The name of the attribute is taken from the corresponding text field. In case the result is
not an element-rooted structure and no auxiliary element is added, this option does not change anything.
If Create number attribute option is selected then the extract tool adds an attribute with the extract
result number to the auxiliary node or to the root element. The name of the attribute is taken from the
corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added,
this option does not change anything.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or the load current tool settings, i.e. XML Tool Queries are supported
here.
The sort tool is used for reordering nodes in the tree representation of a document. The sort operation
changes the order only for nodes which are siblings, i.e. nodes with the same parent. If sort is applied on a set of
nodes with different parent nodes, the nodes will be sorted only in the scope of their parents. The nodes
selection for sorting and the sort criteria (Sort Keys) are written in XPath expressions.
For sorting the user has to specify the following two things:
- The target nodes for sorting.
- The keys for each node.
The first is done by defining an XPath expression in the Select Elements field. If the field is empty,
the sort tool will show an error message. For context node the XPath engine assumes the node selected in the
tree panel of the system or the root elements of the internal documents. The sort tool compares only element
nodes which have a common parent. The sort tool splits the result returned from the XPath evaluation into
groups according to the parent node. Each group is sorted separately.
Keys are calculated for every node the user wants to sort. Each row in the table represents one key. The sort
tool compares two nodes key by key. The key is the list of nodes returned from the XPath engine after
evaluating the expression defined in the column Key of the table. The context node in this evaluation coincides
with the node for which the user wants to create the key. The other columns of the table represent settings used
in the list comparison. The lists are compared node by node as follows.
- If the nodes are both elements then the sort tool asks the DTD which one is defined to be smaller (Element
Features/Sort Values).
- If the nodes are both text nodes they are compared by their textual content.
- The attribute nodes are compared by the textual content of their values only if they have the same name and
their parents are elements with the same name.
- The textual content (text) of text and attribute nodes is compared in the following way:
- The text is compared symbol by symbol.
- If the user chooses a tokenizer then the symbols are compared with respect to the tokens created by the
primitive tokenizer of the selected tokenizer (A tokenizer that is ancestor of the selected tokenizer and is
primitive. If the selected tokenizer is primitive then this tokenizer will be used for tokenization). The symbols
are compared with respect to their token category (the order of the categories in primitive tokenizers) and by
their position in the definition of the token category value. If normalization option is selected, the sort engine
will use the primitive tokenizer normalization table to define the symbols token category and value.
- If the user selects [No Tokenizer], the sort tool will use the Unicode table to compare symbols. In
this case normalization option will mean converting the Capital letters into Small letters case for Cyrillic and
Latin.
- If the user selects the Reverse option for the key, the text will be reversed before the comparison
("erga" => "agre").
- If the user selects the Trim option for the key, the text will be cleared from leading and trailing
whitespace characters (TAB, SPACE, LF, CR, etc.) before comparing.
- If the user selects the Number option for the key, the text will be converted into numbers and
compared by their numeric value.
- If the current nodes are not from one type, then the following order is relevant: attribute < text
< element.
- If a key value for an element contains more nodes than a key value for another element, then the first one
is assumed to be smaller. This assumption is made when all nodes of the smaller key value are equal to the
corresponding nodes of the bigger key.
For each key the user can define different order ( Ascending | Descending ). The order of the keys in the table
is very important because this is the order in which they will be used. If two keys have equal nodes but one of
them has additional elements, then the one with the smaller number of nodes is considered smaller.
The difference between the DTD sort and the Advanced one is that the sort tool takes the tokenizer and the
number option from the DTD (Element Features, Attribute Features). For attribute nodes the sort tool also takes
from the DTD the order of enumeration values.
Examples:
- Example 1: Sorting a book by pages and title. The elements to sort are the book children of the
context node. They will be sorted by the content in their pages element and title element. Key 1 is
the text in the pages element of the book. It will be trimmed and converted to number when sorted. In this key we
do not need a tokenizer because the whole node will be converted to a number. If two elements are equal
according to the first key (two books has the same number of pages) then they are compared with respect to the
second Key. Key 2 is the text in the title element of the book. It will be trimmed and normalized when
sorted. For normalization the sort tool will use the normalization defined in the Mixed Word tokenizer.
The order of this key is descending. It means that this key will sort books by the title in reverse
order.
- Example 2: Sorting TEI divisions by their heads. The sort tool takes all divisions in the document and sorts
them according to the text in their head element. If a division does not have a head element then
it will be assumed as smaller.

Example 1
Example 2
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported
here.
This tool (Fig. 1) gives the possibility to set a certain attribute to
nodes selected by an XPath expression. The user specifies an Attribute Name
and Value. Additionally, s/he can tune the tool to set attributes only
to nodes which do not have such yet. If the checkbox Skip Existing Attributes
is unselected, the tool will set the given attribute with the given value to each element
node returned by the XPath. Otherwise, it will skip all element nodes which already have
this attribute and in this way it will preserve their original values. If the result from
the evaluation of the XPath expression includes nodes other than Element nodes,
they are ignored during the processing time.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported
here.

Fig. 1
This tool (Fig. 2) allows the user to insert certain child nodes in the content of
other Element nodes. The target nodes which the insertion will be applied to are
selected by an XPath expression. The result from the evaluation of the XPath expression
must be a list of Element nodes. All the other types of nodes are discarded. This
tool can insert two types of child nodes: Element and Text nodes. If the
new children are of type Element, the tool expects from the user to supply a
valid tag name. Otherwise, i.e. when the new children are Text nodes, the tool accepts
any non-empty textual data. The user can also set on which position the
new children will appear in their parents' content. Here the counting starts from 0, i.e.
the first child is denoted by 0, the second - by 1, etc. If the position field remains
empty, then the new nodes will be appended to the target nodes' content. Any non-numerical
data in the position field will produce an error.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported
here.

Fig. 2
This tool (Fig. 3) enables insertion of parent Element nodes of selected by
an XPath expression nodes. The selected target nodes by the XPath expression can be either
Element or Text nodes. Any other types of nodes are discarded from the
selection. This tool expects from the user to specify a valid tag name for the new parent
nodes.
The tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported
here.

Fig. 3
This tool (Fig. 4) allows inserting sibling nodes of nodes selected by an XPath
expression. The target selected nodes can be of any kind,but Attribute
nodes. If the root node is selected, it is discarded during the processing time. The new nodes
for insertion can be of type Element or Text. If the new nodes are of
type Element, the tool expects from the user to supply a valid tag name.
Otherwise, i.e. when the new siblings are Text nodes, the tool accepts any non-empty
textual data. The user can also set the position where the new sibling nodes will appear.
The options are: previous (preceding the target node sibling) and next
(following the target node sibling).
The tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported
here.

Fig. 4
This tool (Fig. 5) gives the possibility of removing parts
from XML documents selected by an XPath expression. The target selection can list all
types of nodes (including attributes). The only node which cannot be removed is the root
of the document and if it is included in the selection it is discarded during the processing
time. If a root node is detected in a selection, a warning message is shown. The removal
can be done in two modes: either removing the selected nodes and their content (when Delete
subtree is selected) or removing only the nodes without their content. In the
latter case, the content of the deleted nodes is inserted in the content of their parent(s), in the places where the
deletion was performed. The attribute nodes are not considered as content of the Element nodes they
belong to, so they are removed in both cases.
The tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported
here.

Fig. 5
This operation uses "Before" and "After" columns from the
Element Features
to determine whether to insert space symbol before and after the deleted element.
This tool (Fig. 6) allows the user to rename Element nodes in
a document selected by an XPath expression. The user is expected to supply a valid New name.
The selected nodes are renamed without changing their attributes and content. If the
selection contains nodes of type different than Element, a warning message is
shown and these nodes are discarded from further processing.
The tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported
here.

Fig. 6
This is a tool for applying various transformations over a document or documents.
It is specified by two main sets of nodes - source and target, and other features, which are described below.
The target nodes are nodes over which information will be added.
The source nodes are nodes which give the information which will be added.
Here is the main window of the XPath Transformation:

There are three modes which specify which documents are related with the source and target
fields and where the result will be saved. The modes are: Local Source, External Source
and Distributed Source.
Local Source
In this mode the Source and the Target nodes are from the same document.
If Multiple Apply check box is not selected the target and the source nodes are related with the
current open document in the system.
If the check is selected the target and the source nodes are related with the documents in the Input
column of the Internal Documents table. The result for each document is saved as a document with name
given in the relevant row in the Result column of the same table.
External Source
There is a table with one column, which specifies the source documents. The source nodes are
related with it.
If Multiple Apply check box is not selected the target nodes are related with the current open
document in the system.
If the check is selected the target nodes are related with the documents from the Input column of
the Internal Documentstable. The result for each document is saved as a document with name given in
the relevant row in the Result column of the same table.
Distribution Source
There is a table with two columns - Source and Target, which specifies the
source and the target documents. The user can handle the Source column by the buttons on the right side
of the table and the Target column by the buttons in the Internal Documents field. When the
user adds a document in the Input column, this document is added in the Target column of
the Source/Target table. The number of source and target documents has to be same.
The rest of the features of the XPath Transformations dialog are:
Target
An XPath expression defining the target list of nodes, e.g. the nodes where the source will be
included.
as a parent
The nodes from the source become parents (ancestors) of the target nodes. The system requires
Element nodes for source and Element and Text nodes for target.
as a child
The nodes from the source become children of the target nodes in the position specified
in the at position field. The system requires zero or a positive integer for the position,
non Attribute nodes for source and Element nodes for target. If the returned value as a source
is a number, a string or a boolean value, it is treated as a text node.
as a sibling
The nodes from the source become siblings of the target nodes in a position before or
after a target node depending on the at offset field. The system requires non Attribute nodes for
source and Element nodes for target. If the returned value as a source node is a number, a string or a
boolean value, it is treated as a text node.
as attribute
The nodes from the source become attributes of the target nodes with name specified
in the with name field. The system requires non Element node for source and Element
nodes for target.
Relative to Source
This check box is used only when the source is treated as an XPath expression
(XML check box is not selected).
When this check box is not selected, the target XPath is evaluated from the root of the target
document.
When this check box is selected, the target XPath is evaluated for every node from the source
as a context. As a result there is a list of nodes for each node in the source.
Source XPath/XML
This field specifies the source nodes. They could be nodes returned by XPath expression
(evaluated on a specified document) or specified by an XML fragment. Whether the source is treated as XPath
expression or XML fragment is specified by the XML checkbox.

All nodes from the source list will be processed for each target node.

Each node from the source list will be processed for each target node.
Equals
If this check is selected and if there is a difference between the number of source and target
nodes the system reports an error.
Copy
If this button is selected, the source nodes are copied to the target nodes in a way specified
by the tool fields.
Move
If this button is selected, after performing an operation for a node
from the source list, the tool removes the node from the source location.
Include subtree
If this check box is selected, then the source list will contain for each selected node the
entire subtree. If it is not selected, then only the local information for each node is put in the source list.
The local information includes the tag name and the set of attributes as well as their values.
When only a node with the local information is chosen and it has to be removed, then its children are
inserted as immediate children of its parent. The insertion is made in the position of the deleted
node.
XML
By this check box the treatment of the Source XPath/XML field is controlled.
If it is selected, then the source is treated as XML markup data. If the XML markup data does not contain
tags, then it is treated as text.
If the check box is not selected, then the source is treated as an XPath expression.

The Statistics tool is used for counting the number of nodes or/and token occurrences in XML document(s).
The items to be counted initially are selected by an XPath expression (field Select (XPath)). The
selection returned by the XPath evaluation is a node set. At this point the Value Keys defined by the user
are taken into consideration. Each key contains an XPath expression which is meant to point the essential
properties of the selected nodes. The value keys are similar to the ones in the Sort tool. For each node of the initial selection the values from the Value
Keys are calculated independently. If for two nodes the corresponding calculated values are the same, they
are assumed to belong to the same class. In this way each of the selected nodes is classified in one class. If the
statistics has to be applied not only on XML nodes, but on tokens the user must select a tokenizer from
Choose Tokenizer: field. In this way the text nodes will be segmented in meaningful tokens. In addition
the user can filter the tokens by category in order to receive information only for certain types of tokens (using
the button Customize). Only tokens whose categories are in the list will be counted. All the rest will be
discarded. If no tokenizer is selected, the text nodes will be processed as a whole node.
The result from the statistics application is a list
of all classes formed by the selected nodes. The information which is kept for each class is:
- Searched Item - the item found by the selection (tag name or token);
- Item Category - the category of the search item: if the item is a token - its token category, otherwise - <Element>;
- Number of occurrences - the number of items from the selection which belong to the class;
- Percentage - the percentage of the items belonging to the class, compared to the rest from the selection;
- Keys Value - a string representation of the value(s) for the class.
This tool supports two modes of application: on the current document and Multiple
Apply mode. For details see Tool
Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported
here.
Statistics on Current Document:
The result of this type of statistics is shown as a table below.

The "Category" column contains categories from the filter which exist in the chosen
text nodes or "<Element>" if the row represents element node, and "#text"
if the node is text.
The "Element" column contains tokenized text (The value of the filtered tokens), or node names.
The "#" column contains number of occurrences of the corresponding item.
The "%" column contains information about the percentage of the corresponding item.
The "Key Value" column contains the value of sort keys created for the corresponding node or nothing if the line contains token.
Closing the table, the user can choose from the following options:
- to save the result of the statistics into XML format, using the DTD definition below.
- to open the result in the system.
Statistics on Multiple Apply:
-
the result is preserved in XML format and the document DTD has the following structure:
<!ELEMENT statistics (documents, item+ , all )>
<!ELEMENT documents (document+)>
<!ELEMENT document (#PCDATA)>
<!ELEMENT item (category?, element, number?, percent?, keyvalue?)>
<!ELEMENT category (#PCDATA)>
<!ELEMENT element (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT percent (#PCDATA)>
<!ELEMENT keyvalue (#PCDATA)>
<!ELEMENT all (number, percent)>
]>
documents tag is a list of selected for statistic documents, where each document name appears in a document tag; item tag corresponds to a line from a result table as follows:
- category tag corresponds to the "Category" column
- element tag corresponds to the "Element" column
- number tag corresponds to the "#" column
- percent tag corresponds to the "%" column
- keyvalue tag corresponds to the "Key Value" column
all tag corresponds to the last row of result table. It contains the number of all the
occurrences of the selected elements and information for percentage.
Filtering XML Result Data
In many cases not all the information from the Statistics is needed to be saved. Sometimes
the result data is too large and its further processing is difficult. In such cases the Output Info options can
be used for specifying which information should be kept and which should be removed from the result document.
The options are as follows:

Only the selected items will have a representation in the result XML document.
The following item gives information about the number of occurrences of specified tags or tokens in a set of internal documents. When the user starts this tool, s/he is asked to provide several things:
- The type of information which is needed. The two possibilities are: counting tags and counting tokens.
- The documents for which