This tool handles documents, which contain symbols, not supported by the local hardware
architecture. It substitutes the symbols with entities according to the standard ISO 8879
and vice versa. Currently, this tool supports 19 sub-sets of entity-char conversions. Each
of them can be activated or deactivated. The more sub-sets are activated, the more time is
needed for processing (conversion). One reason for excluding some of the sub-sets is the
following : sometimes not all the symbols have to be converted, for example: commas, dots,
colons, semicolons ....
Example: ("дума" in Bulgarian is the equivalent of "word")
"дума" <-- entity conversion -->
"дума"
The tool operates on the document which is currently opened in the system. It can be
started from the main menu:
- Tools/Entity Converters/Char --> Entity - Converts all the symbols included in
the currently activated conversion sub-sets into entities. The current conversion sub-sets
can be seen in Tools/Entity Converters/Entity Management.
- Tools/Entity Converters/Entity --> Char - It enables the opposite conversion
i.e., from entities to symbols (characters).
- Tools/Entity Converters/Entity Management - This item visualizes the manager
window that is responsible for the activation/deactivation of the different sub-sets of
entity-char converters.
The dialog window:
The window shows the current active converters (filters). Each of them can be
deactivated by removing it from this list (button 'Remove filter').
In order to see the symbols (entities respectively), the user can press the View
filter button. Having pressed it, a table appears on the screen containing a detailed
information about the filter. Each row represents one pair entity-symbol. The table has 3
columns: the first for the entities, the third for the symbols and the second is for the
unicodes of the symbols represented as entities.
In order to activate one or more filters, which are not already in the list, the user
can press Add filter button. A new dialog window appears which contains a list of
all the available filters, which are not active. By selecting the checkbox opposite to
each of then the user can activate filters. Here he/she can see the currently selected
(non-active) filter content before adding it (button Preview). Optionally all
filters can be added with button Add All.
When the entity filter management has been completed, the new settings can be updated
by the Done button and then the window is closed.
There are two more buttons (Apply "Entity --> Char" and Apply
"Char --> Entity") which apply the corresponding conversion upon the
current document for the system. It performs the same action as the conversion tool from
the main menu of CLaRK System. In addition the changes in the active filters(if any) are
updated.
This dialog is used for various "mass" commands for document restructuring.
Generally, the scenario is the following: (1) a list of nodes (subtrees, text elements) is
chosen by the Source field. In this way it is defined what will be copied or moved
in the document; (2) a list of nodes is chosen by the Target field. In this way the
place(s) where the source elements will be copied or moved is defined; (3) the elements
from source list are attached to the elements of the target list. There are several option
defining the way of performing the above action. The concern refer to such thing as
whether the elements of the source are copied or cut from the document before being
attached to the target, the mapping between the source and the target elements - there are
possibilities elements of the source to be attached to each element of the target, or each
element of the source to be attached to the corresponding element of the target.
A more detailed description of each field and action follows.
Source
This is a description of the data which is to be copied or moved. The description can
be an XPath expression, an XML markup data or some text. If the description is an XPath
expression then it is evaluated to a list of nodes depending on the document selected in
the combo box under the source field. The XPath expression is evaluated before the actual
change of the document had taken place. If the source is XML markup data, then it need not
have a root element. The XML markup data is parsed to a list of XML nodes. If only text is
given, this very text is considered as one element list.
Copy versus Cut
When the Source is defined by an XPath expression the elements in the resulting list
can be either kept left in the document or deleted from it before the rest of the
processing. If one chooses to copy elements, first a copy of the source list is
created and then the operation proceeds further with that copy. If one uses cut,
then the elements are first removed from the tree and then the operation continues. When
one cuts the elements, the destination XPath cannot be relative.
Note: If you cut the elements, then they will not be present in the tree when
destination XPath is evaluated.
Include subtree check box
If this check box is on, then the source list contains for each chosen node the entire
subtree under the chosen node. If it is off, then only the local information for each node
is put in the source list. The local information includes the tag name and the set of
attributes as well as their values. When only a node with the local information is chosen
and it has to be cut then its children are inserted as immediate children of its parent.
The insertion is made in the position of the deleted node.
Treat source as XML markup check box
By this check box the treatment of the Source field is controlled. If it is on, then
the source is treated as XML markup data . If the XML markup data doesn't contains tags
then it is treated as text.
If the check box is off, then the contents of the Source field is treated as an XPath
expression.
Target
An XPath defining the target list of nodes, e.g. the nodes where the source will be
included.
Absolute or Relative?
When Absolute , the XPath is calculated from the beginning of the document
selected in the document combo box under the destination field, e.g. the XPath
"self::*" will return the document element.
When Relative, the XPath is calculated for every node from source with this node
as a context. As a result you get a list of nodes for each node in the source.
Insert node(s)
Defines the position where the source is to be included, relative to a node in the
target list.
As a parent
The nodes from the source become parents (ancestors) of destination.
As a child
The nodes from source become children of the destination in an appropriate position. If
the number in the box is less than 0 or it is not a number, then they become last children
of the destination nodes.
As a sibling
The nodes from the source become siblings of destination in the appropriate position. 1
means next sibling, 2 means the sibling after the next sibling and so on. -1 means
previous sibling, -2 means the sibling before the previous sibling and so on.
Options
Copy all nodes in the source list to every node in the target list
This is allowed only when the target XPath is Absolute. It works according to
the selected position.
- as parent
The elements in the source list are treated as a path in an XML
document where the first element is a parent of the second, the second of the third and so
on. If the source list contains whole subtrees, then each element in the source list is
consider as a last child of the previous element. The constructed path is inserted in the
document so that the first element is inserted in the place of the node from the target
list and the target node is inserted as a last child of the last element of the source
list. For each node in the target list a new copy of the source list is taken.
- as child
The elements in the source list are inserted as children of each node
in the target list in the ordering in which they appear in the source list. The insertion
is done at the indicated position. For each node in the target list a new copy of the
source list is taken.
- as sibling
The elements in the source list are inserted as siblings of each
node in the target list in the ordering in which they appear in the source list. The
insertion is done at the indicated position. For each node in the target list a new copy
of the source list is taken.
Although the program always makes a copy, if cut is selected the nodes in the
source list will be deleted.
Copy each node from the source list into the corresponding node in the target list
This option allows a pair wise inclusion of the source list into the target list. When
the target is Absolute, the first node from the source list is attached to the
first node in the target list, the second node from the source list to the second in the
target and so on until there are no nodes left in the source or in the target list. When
the target is Relative, each node in the source list is attached to the first node
in its relative target list. There is an option to check whether the two lists are of
equal sizes (when Absolute) or the relative lists contain a single node (when Relative
).
Copy all nodes from the source list as parents of all nodes in the target list
The elements in the source list are treated as a path in an XML document where the
first element is a parent of the second, the second of the third and so on. If the source
list contains whole subtrees then each element in the source list is considered to be the
last child of the previous element. This path is inserted in the document as a common path
of parents for all nodes in the target list. This is done by searching for the nearest
common parent for all nodes in the target list (note that this parent exists). Then all
nodes which are children of that parent and are placed between the first and the last node
in the target list are removed and on their place the first node in the source is put
instead. Then the removed nodes are added as the last children of the last node in the
source list.
When the target XPath is Relative the above procedure is repeated for each node
in the target list.
Buttons:
The current document can be transformed via an XSL Transformation. The user is asked to
choose a valid XML document which contains the XSLT(It must be an internal document). The
result from the transformation is a new document, which is loaded in the system.
The tool represents a regular expression grammar editor. The user can edit an existing grammar or
create a new one. The available grammars can be selected from the combo box at the top of
the dialog. The user can create a new grammar, to remove, rename or update a grammar. In
order to start editing, the user must first create at least one grammar with the help of
the New button. Grammars without a name can not be edited. After each change, the
grammar has to be updated with the Update button. Grammars that are not updated are
removed from the memory when pressing the Exit button.
Each table line represents one grammar rule. The expressions in column Regular
Expression (bodys of the rules) have to match the tokens and mark-up in the document. This
column should be always non-empty. The expressions in Left Regular Expression (left context)
and Right Regular Expression (right context) columns determine the context in which the
matched tokens and mark-up should appear in order for this rule to work. If there is no left or
right context specified, the grammar presumes that all contexts are valid. The XML markup in the
Return Markup column is used for marking the matched data. A comment field is
added for user's own commentary.
The Tokenizer combo box determines the current grammar tokenizer. If a tokenizer
is selected it will be used when the system creates the grammar input. If no tokenizer is
selected, then the tokenizer from the DTD (Element Features) will be used. The Filter
combo box determines the filter that will be used when applying the current grammar. The
Matches combo boxes represent the match option for Left Regular Expression
(left context), Regular Expression (body) and Right Regular Expression (right
context). The Any up and Any Down matches in the body combo box can be used for
backtracking. If any Any up option is selected the grammar finds the shortest sequence of
tokens and mark-up that that is recognized by the body of a grammar rule and is correct according to
the left and right context of this rule. The difference from shortest match is that in shortest
match the grammar engine will choose the shortest possible sequence and if the left or right context
fails the whole sequence will fail. Example:

In this example we are applying this grammar on the following sequence of symbols: b,a,b,c and
the grammar is on the symbol "a". If we use Shortest match then the grammar will
use the second rule because it is the shortest possible match and will fail on the left context of
this rule and the grammar engine will go to the next symbol "b". If Any Up match
is used then the grammar will choose the first rule although it matches a longer sequence.
If Any Down option is selected the grammar finds the longest sequence of tokens and
mark-up that is recognized by the body of a grammar rule and is correct according to the left
and right context of this rule. Clark grammar engine implies four modes for checking the left
and right context:
- Left Right - checks the left context first and then the right one.
- Right Left - checks the right context first and then the left one.
- Backtracking Left - checks the left context first and then the right one. If the right
context fails the grammar engine will try to find a longer or shorter sequence of words (depending
on the type of match selected for the left context) in order to use the right context of another
rule instead. Example:

In this example we are applying this grammar on the following sequence of symbols: c,c,a,b,a and
the grammar is on the symbol "a". If we select Left Right mode the grammar engine
will use the first rule because it matches the longest left context but the grammar will fail
on the right context. If Backtracking Left mode is selected the grammar engine will prefer
the second rule because it is correct even though it is has a shorter left context.
- Backtracking Right - checks the right context first and then the left one. If the left
context fails the grammar engine will try to find a longer or shorter sequence of words (depending
on the type of match selected for the right context) in order to use the left context of another
rule instead.
XPath expression (Apply to text field) selects the nodes to which the grammar to be
applied.
Buttons:
- New - creates a new grammar.
- Update - updates the grammar within the system memory.
- Remove - removes grammar from the system memory.
- Rename - renames a grammar.
- Exit - closes the grammar editor, prompts for unsaved grammars.
- Save Grammar - saves grammar(s) to file.
- Load Grammar - loads grammar(s) from a file.
- Apply Grammar - the user can apply the grammar to the current document (if there
is one) using the grammars XPath expression (The user should save the grammar before
applying in order to use the new grammar settings).
- The Feature menu gives an access to the DTD Element Features and Attribute Features.
For each grammar the user can define element values. The element values are XPath
expressions that are evaluated for the corresponding elements to determine their value
when applying the grammar. If no element value is defined for some element then it is
taken from the DTD. Here is an example of element values.
Each row of the table represents an element value for an element. Both columns should not
be empty. If one element has two element values the first one is used. When the user
presses the OK button the XPath expressions in the table are checked for correctness.
This menu item applies a grammar to the current document. In the Choose grammar
field the user selects a grammar to apply. In field Select nodes the user has to
specify an XPath expression to select nodes on which to apply the grammar. If the grammar has
a defined XPath expression it will appear automatically upon grammar selection in the Choose
grammar combo box. If no XPath expression is entered, the tool will produce an error
message. The user can select also a tokenizer and a filter in the Choose tokenizer
and Choose filter combo boxes.

It applies one or more grammars or grammar groups on the current document in a cascaded
way. The grammars and groups are added to a list which will be executed in the order of
the items.

Buttons :
- Insert Grammar - inserts a grammar;
- Insert Group - inserts a grammar group;
- Insert Save - when reaching this spot in the queue the system prompts the user to
save the processed document.;
- Remove - removes the chosen item;
- Apply - starts applying the constructed grammar queue to the current document;
- Exit - closes the dialog window.
This menu item executes a grammar on the current document. As a result the tokens and
mark-up recognized by the grammar are selected in the Text Area. Additionally it allows
the user to mark the recognized information with an XML mark-up.

Buttons:
- Search - executes the grammar on the current document;
- Next - finds the next group of tokens and mark-up that matches the grammar;
- Previous - finds the previous group of tokens and mark-up that matches the
grammar;
- Mark - marks the selected data with an XML mark-up taken from the grammar or
written by the user.
- Edit - opens the grammar editor tool;
- Exit - closes this dialog.
This menu item represents a grammar group editor. The user can set grouping of grammars
in order to apply them together. The groups can be created, modified, removed or
renamed. Grammar groups are created in order to enable the user to apply several grammars
in cascade style. Grammar groups are applied via the Apply
Multiple system tool.
Buttons :
- Insert Grammar - inserts a new grammar in the grammar group;
- Remove - removes the selected grammar from the grammar group list.
- New - creates a new Grammar group;
- Save - saves the grammar group;
- Rename - renames the Grammar Group;
- Remove Group - removes the currently opened grammar group.
- Exit - closes the dialog window.
The sort tool is used for reordering nodes.
For sorting user specifies the following things:
- The nodes to be sorted.
- The keys for each node.
The first is done by defining an XPath expression in the Select Elements field.
If the field is empty the sort tool will return an error message. For context node the
XPath engine assumes the node selected in the tree in the main panel. The sort tool
compares only element nodes which have a common parent. The sort tool splits the result
returned from the XPath evaluation into groups according to the parent node. Each group is
sorted separately.
Keys are created for every node we want to sort. Each row in the table represents one
key. The sort tool compares two nodes key by key. The key is the list of nodes returned
from the XPath engine after evaluating the expression defined in the column Key of the
table. The context node in this evaluation coincides with the node for which we want to
create the key. The other columns of the table represent settings used in the comparing of
the lists. The lists are compared node by node.
- If the nodes are both elements then the sort tool asks the DTD which one is defined to
be smaller (Element Features).
- If the nodes are both text they are compared by their textual content.
- The attribute nodes are compared by the textual content of their values only if they
have the same name and their parents are elements with same name.
- The textual content(text) of text and attribute nodes is compared in the following way:
- The text is compared symbol by symbol.
- If the user chooses tokenizer then the symbols are compared based on the tokens created
by the primitive tokenizer of the selected tokenizer (A tokenizer that is ancestor of the
selected tokenizer and is primitive. If the selected tokenizer is primitive then this
tokenizer will be used for tokenization). The symbols are compared based on their token
category (the order of the categories in primitive tokenizers) and by their position in
the definition of the token category value. If normalization option is selected, the sort
engine will use the primitive tokenizer normalization table to define the symbols token
category and value.
- If the user selects "No Tokenizer" the sort tool will use the Unicode table to
compare symbols. In this case normalization option will mean converting the Capital
letters in to Small letters case for Cyrillic and Latin.
- If you select the reverse option for the key, the text will be reversed before the
comparison ("erga" => "agre").
- If you select the trim option for the key, the text will be cleared from the whitespaces
(TAB,SPACE,LF,CR) in both ends before comparing.
- If you select the number option for the key the text will be converted into numbers and
compared by their number value.
- If the current nodes are not from one type then the following order is relevant:
attribute < text < element.
- If a key for one element has more nodes then a key for another element then it is
assumed smaller. This assumption is made when all nodes from the smaller key are equal to
the corresponding nodes of the bigger key.
For each key the user can define different order ( Ascending | Descending ). The order
of the keys in the table is very important because this is the order in which they will be
used. If two keys have equal nodes but one of them has additional elements then the one
with smaller number of nodes is considered smaller.
The difference between the DTD sort and the Advanced one is that the sort tool takes
the tokenizer and the number option from the DTD (Element Features, Attribute Features).
For attribute nodes the sort tool also takes from the DTD the order enumeration values.
Examples:
- Example 1: Sorting a book by pages and title. The elements to sort are the book children
of the context node. They will be sorted by the content in their pages element and title
element. Key 1 is the text in the pages element of the book. It will be trimmed and
converted to number when sorting. In this key we don't need tokenizer because the whole
node will be converted to number. If two elements are equal according to the first key
(two books has the same number of pages) then they are compared based on the second Key.
Key 2 is the text in the title element of the book. It will be trimmed and normalized when
sorting. For normalization the sort tool will use the normalization defined in the
"Mixed Word" tokenizer. The order of this key is descending. This means that
this key will sort books by the title in reverse order.
- Example 2: Sorting TEI divisions by their heads. The sort tool takes all divisions in
the document and sorts them according to the text in the text in their head element. If a
division does not have a head element then it will assumed as smaller.
Example 1
Example 2
This menu item represents the Tokenizer Editor dialog . The user can select different
tokenizers using the "Tokenizer" combo box. Note that there is always at least
one tokenizer in the system because the "Default" tokenizer is not editable and
can not be removed.
The user can create, remove or save a tokenizer. Each row in the table represents one
tokenizers category. The user can add remove and reorder rows with the menu shown when the
user right clicks over a row in the table. The first column is the category name. The
content of the second column depends on the type of the tokenizer. The column contains
category value (all the symbols in the category) if the tokenizer is "primitive"
or regular expression for a "non-primitive" tokenizer. Here is an example:
This is a primitive tokenizer which name is Mixed. It has categories LAT, CYR, SYMBOL
... Category LAT represents the Latin symbols. Category NUMBER represents the numbers from
0 to 9...
Buttons:
- Save - saves the current tokenizer.
- Remove - removes tokenizer.
- Sort Order - defines the order of categories of a primitive tokenizer.
- Change Parent- changes the parent of a non-primitive tokenizer.
In order to create a new tokenizer the user must press the New button and set
the options in the new tokenizer dialog.
If the tokenizer is primitive the user must select the Primitive check box.
Otherwise the user must indicate the parent of the tokenizer in the Parent combo
box. The use an existing tokenizer as basis for the new one with the Use current
button. This tokenizer has to be in the dialog window.
When defining the category value for a primitive tokenizer the user should be aware of
the following rules:
- The characters are quoted with single or double quotations or referenced by a number.
Example "." or ";" or 'k' or 32 (Space)
- If the user wants to use more than one symbol for category he/she should separate the
symbols by a comma. Example "a","b","c",...
- If the user wants to define a range of symbols, the starting and ending symbols must be
connected with a dash. Example : "A"-"Z".
- One character can not have more than one category
- A category can be defined on more than one row.It is interpreted as conjunction between
the two categories. Here is an example:
The tokenizer tool will interpreter this lines as LAT "'a'-'z','A'-'Z'".
Here is one example of a primitive tokenizer :
For each primitive the user can define the sort order of the categories by clicking the
Sort Order button. Example:
This dialog will be shown if the user presses the Sort Order button when the
Mixed tokenizer (the tokenizer in the first example)is on the screen. The use can reorder
the categories with the menu shown when the user right clicks over a row in the table.
Also a normalize option can be defined for each category of the tokenizer by right
clicking on the category line in the table. For instance the following dialog will appear
if we select normalize for the "LAT" category:
For each symbol of the category the user must select a corresponding normalized symbol.
The "New Category" combo box determines the new category that the symbols will
receive after the normalization. This category can coincide with original category of
symbols before the normalization or to be completely different.
When defining a non-primitive tokenizer, the user should follow the following rules:
- Each category Regular Expression must be a valid regular expression.
- Two categories can not have a common token.
Each non-primitive tokenizer must have a parent tokenizer. The parent of a tokenizer is
set when the tokenizer is created and can be changed by the user when pressing the Change
Parent button. Here is an example of a non-primitive tokenizer:
The parent of this tokenizer is the "Mixed" primitive tokenizer shown on the
first example. This non-primitive tokenizer uses the categories "LAT" and
"CYR" from the parent tokenizer.
The user can undo changes made on tokenizer with the Undo button.Tokenizers can
be loaded or saved to an external file. The Exit button closes the tokenizer editor
dialog. The system prompts the user to save all unsaved tokenizers on exit.
This menu item starts the filter editor. In order to browse the filters in the system,
the user can use the "Filter" combo box at the top of the dialog. The user can
add token categories from different tokenizers or add XPath expressions to filter element
nodes.
The "Token Types" list is the list of the filtered token categories. The user
can take the categories from the tokenizers in the system (The "Choose From"
list on the left side of the dialog) and add them to the list of filtered token categories
with the arrow ("=>") button. In order to add an XPath for an element
filtering the user must press the "add XPath" button. The new XPath expression
will be added to the "Expression" list. To remove a token category an XPath from
one of the list the user must press the corresponding "Remove" button. The user
can remove and save filters or create new ones.
The Element Features is used to add information to the elements of a DTD.
The user can add the following additional information:
- Tokenizer for the elements of the DTD. Used for tokenization of the element text data by
the Grammar and Sort engines.
- Default Tokenizer for a DTD. If no Tokenizer is defined for a element the grammar and
sort engines will look for the the DTD default tokenizer. After compiling the DTD receives
the "Default" primitive tokenizer for default tokenizer.
- The user can state that the content of an element is number.
- The user can define a XPath Value for each DTD element.
- The user can define order over the DTD elements. Option used for sorting purposes.
In order to select the default tokenizer for the DTD the user must select an item in
the "Default Tokenizer" combo box. The user can select a tokenizer for each
element of the DTD in the "Tokenizer" column of the table by clicking on a table
cell. The check boxes in the "Number" column are used by the sort tool to
determine whether the content of the current element can be treated as a number. For
example pages and price can be treated as numbers during comparing of two books. The
values in the "XPath Value" column are used by the Grammar Engine to define the
value of the element nodes.

The order of the elements can be defined in the sort table shown when the user press
the "Sort Order" button. When comparing two elements the position in sort table
defines their order. The user can change position of elements by dragging their rows to
correct positions or using the context menu opened when the user right clicks on sort
table row. Here is example:


The Attribute Features is used to add information to the attributes of a DTD.
The "Element" column of the table represent all the elements in the DTD that
have an attribute. One element can be on several rows because it can have more than one
attribute. In the "Attribute" column are presented all the attributes of the
DTD. In the "Tokenizer" column the user can select different tokenizers for each
attribute. The sort tool uses the information from the "Number" column to select
how to compare the value of the attributes (As plain text or as a number).

An additional feature is the order of enumerated attribute values. The attributes with
enumerated values have "(e)" string at the end of the name. In order to sort the
enumerated values click on an attribute with enumeration value and click the "Sort
Values" button. Example:
The values are sorted in ascending order. The user can change the order by dragging the
rows or by the right mouse button.
The extract tool task is to extract nodes from a document or from multiple documents
and to save them as a new document. The text field at the top of the dialog is used for
defining an XPath expression which selects the elements in the document(s). The context
node for this evaluation is the root node of the document(s). The user can extract from
the currently active document in the system or from the internal documents. The result
from the extraction is an XML document in which all extracted nodes are children of the
root element (This element is named "Extract" by the system).
The "include subtree" option allows the extraction not only of the selected
nodes but the entire subtree below as well.
Auxiliary tag - the extracted node can have a parent (prompts for attribute name)
element which is used to separate results. For example, if we extract only text nodes then
in the new document all the text nodes will be merged. If the auxiliary tag is selected,
then the Number and Source prompt will be shown in the dialog.
If "Number" is selected then the extract tool adds an attribute with the
extract result number to the auxiliary tag.
If "Source" is selected, then the extract tool adds an attribute with the
source document name to the auxiliary tag.

The Statistic Dialog is used to show information about the number of the occurrences of
some elements.
The user can select elements by an XPath expression ("Search" field). It is
recommended always to check this expression in the systems search tool.
The elements are selected from the internal documents in the system. The XPath is
applied to the root element for every selected document in the standard "internal
document selector".
The selected elements are sorted by sort keys identical to the keys described in the
sort tool manual. As we extract information from multiple documents and we can not
determine one DTD only "Advanced" sort mode is available.
The user can choose a tokenizer from all tokenizers defined in the system
("Tokenizer" combo box) in order to process the text nodes. The "Tokenizer
Category Filter" is used to select the categories derived as result after
tokenization with the selected tokenizer. If no tokenizer is selected the text nodes will
be processed as a whole node. Only the tokens with the chosen categories are shown in the
result.
Buttons: "OK" button shows result. "Cancel" button closes window.
The Result is shown as a table. Here is an example:

The "Category" column contains categories from filter which exist in chosen
text nodes or "<Element>" if the row represents element node and
"#text" if the node is text.
The "Element" column contains tokenized text (The value of the filtered
tokens), or node names.
The "#" column contains number of occurrences of the corresponding item.
The "%" column contains information for percentage of the corresponding item.
The "Key Value" column contains the value of sort keys created for the
corresponding node or nothing if the line contains token.
Result can be saved as an XML document following the result table structure. The DTD of
the document has the following structure:
<!DOCTYPE statistics [
<!ELEMENT statistics (documents, item*, all)>
<!ELEMENT documents (document+)>
<!ELEMENT document (#PCDATA)>
<!ELEMENT item (category, element, number, percent)>
<!ELEMENT category (#PCDATA)>
<!ELEMENT element (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT percent (#PCDATA)>
<!ELEMENT keyvalue (#PCDATA)>
<!ELEMENT all (number, percent)>
]>
documents tag is a list of selected for statistic documents, where each
document name appears in a document tag. item tag corresponds to a
line from a result table as follows:
- category tag corresponds to the "Category" column
- element tag corresponds to the "Element" column
- number tag corresponds to the "#" column
- percent tag corresponds to the "%" column
- keyvalue tag corresponds to the "Key Value" column
all tag corresponds to the last row of result table. It contains the
number of all occurrences of selected elements. And information for percentage
Pressing "Cancel" button closes window and returns to main window of
the system.
Concordance - System tool for information extraction. The filed at the top of the
dialog is used to define the context. If the XPath expression in it is invalid no elements
will be extracted. For context the root of the document is used. The user can switch
between two types of search with the tabbed pane. The result from the concordance is an
XML document in which the found data is separated in lines. A line is a XML document with
the following structure :
<L>
<LC> "the left context" </LC>
<I> "the data we are searching for" <I>
<RC> "the right context" </RC>
<COM> "Element for user commentary" </COM>
</L>
1. The grammar search. - The user must select a grammar or create one of his own to
search with it. To create a grammar select "<custom grammar>" item in the
combo box and press edit button. The concordance tool will open a standard grammar editor
with reduced options. After the user have finished editing the system asks whether he/she
will use the created grammar. If the answer is no, then the system will presume that there
is still no grammar. The user should press the "Save" button if he/she wants to
save the newly created grammar in the system memory. If the user selects a grammar from
the list then this grammar will be used for searching. If edit button is pressed while a
grammar from the list is selected the system will open the selected grammar in a standard
grammar editor. On exit the user will be asked whether he/she wants to use this grammar.
If the answer is positive the modified grammar will appear on the place of the custom
grammar in the combo box. The save option is available only for the custom and modified
grammars . The user can select another grammar to restrict the context in which the search
grammar will be applied. If the "Text Only" check box is selected the grammar
will ignore the mark-up inside the context while checking. Example :
2. The XPath search - if the user uses an XPath to search inside of context the search
field (the top text field in the tabbed pane) should contain valid XPath expression. All
nodes selected with this expression will become items in the concordance result. If fields
for the left or right context are not empty the concordance tool will use the expressions
inside them to form the left and right context of the items. If not, the contexts will be
generated automatically. Example:

If the "Add Source Attribute ?" is selected, then to every extracted line the
concordance tool will add an attribute with the name of the source file. If the "Add
Number Attribute ?" is selected, all extracted lines will receive a number attribute.
The Table View tool is created to represent the information extracted from the
concordance tool in more readable table form. Each line of the table represents one line
of the concordance result. The data in the "Context" columns does not represent
the whole context but only the amount of data that can fit in the column length. At the
beginning it is only 30 symbols. To increase the context the user should press the
settings button and from there to determine the context in symbols. The user can also set
the width of the comment column. If the user wants to see the context without expanding
the column data he can do it with right click on the "Left Context" and
"Right Context" column. If the user wants to add commentary to a concordance
line he can do so by filling value in the "comment" column or by right clicking
a row in the "item" column. To navigate fast through the table the user can use
the combo box in the top to access a row. The user can sort lines of the table.
The user must select on which column to apply each sort key (which element of the
concordance line will be the context LC,I,RC or COM). If no column is selected then the
key will be executed with the line element for context.
Useful option of the Table View is the "Edit Layout". The user can filter the
tags that are shown in the table. For example, if the POS information is separated in a
tag, the user can hide it in order to view only the text.
The following item give information about the number of occurrences of specified tags
or tokens is a set of internal documents. When the user starts this tool, s/he is asked to
provide several things:
- The type of information which is needed. The two possibilities are: counting tags and
counting tokens.
- The documents for which information is needed.
- An XPath expression which selects nodes in each document for which the counting will be
preformed.
Here is a screen-shot of the initial dialog window:

The major components of the dialog window are:
- XPath Field - selects the nodes in each document for which the counting will be
performed. If tags will be counted, then for each node from the selection of this XPath
expression, its descending nodes will be counted. If tokens will be counted, then for each
text node from the selection its text content will be tokenized and the result tokens will
be counted.
- Tokenizer Selector - determines which tokenizer will be used when tokenizing the
text nodes for token counting. This component is disabled in case of tag counting.
- Info Type Selector - determines the type of elements, which will be counted. The
options are: "Word Info" - for token counting and "Tag Info" - for tag
counting.
- Document Selector - this component is responsible for selecting documents from
the internal document database, on which the counting will be applied. This is an
universal component for the CLaRK system. For more information see Document Selector in menu File.
- Show Info button - starts calculating the information for the selected documents.
- Cancel button - closes the window and cancels further processing.
If the Show Info button is pressed, the system starts to process the selected
documents one by one. While processing the documents, the status bar of the system shows
the current process state. Having processed all selected documents the system shows the
result in a new window. Here are two example results, one for Word Info and one for Tag
Info:
- Word Info

The first column Document contains the names of the documents chosen from the
first dialog.
The second column Category contains the categories from the tokenizer which the
user has chosen before.
The third column # contains the number of occurrences of each category in the text.
The content of the table can be saved in a file - if Save if file checkbox is
selected. The syntax of the result file content is XML. When the user presses the OK
button, s/he will be asked to supply a file name and a directory with a standard file
chooser.
If Add information checkbox is selected then the relevant information will be
added to each of the documents. The word information added to a document has the following
form:
<extent>
<interpGrp>
<interp type ="LATw" value="20"></interp>
<interp type ="TAB" value="1"></interp>
<interp type ="CYRw" value="9350"></interp>
<interp type ="NUMBER" value="237"></interp>
<interp type ="SYMBOL" value="129"></interp>
<interp type ="PUNCT" value="1720"></interp>
<interp type ="SPACE" value="9376"></interp>
</interpGrp>
</extent>
If the DTD for a document is TEI <extend> is added in the appropriate
position. Otherwise, <extend> is added after the first node.
- Tag Info
The first column Document contains the names of the documents chosen from the
first dialog.
The second column Tag contains the tag names of all nodes which the user has chosen
with the XPath expression from the first dialog.
The third column # contains the number of occurrences of each tag in the documents.
The content of the table can be saved in a file - if Save if file checkbox is
selected. The syntax of the result file content is XML. When the user presses the OK
button, s/he will be asked to supply a file name and a directory with a standard file
chooser.
If Add information checkbox is selected then the relevant information is added
to each of the documents. The information added to each document has the following form:
<encodingDesc>
<tagsDecl>
<tagUsage gi="hi" occurs="40"></tagUsage>
<tagUsage gi="p" occurs="153"></tagUsage>
</tagsDecl>
</encodingDesc>
If the DTD for a document is TEI <encodingDesc> is added in the
appropriate position. Otherwise, it is added after the first node.