Home
Description
Publications

Available Resources
Text Acknowledgements
Related links


Events


CLaRK System

CLaRK System Online Manual


Bulgarian dialects'
electronic archive




 

 

 

 

 

 

 

title.gif (18679 bytes)

CLaRK System Online User Manual


Menu Tools

Entity Converters

This tool handles documents, which contain symbols, not supported by the local hardware architecture. It substitutes the symbols with entities according to the standard ISO 8879 and vice versa. Currently, this tool supports 19 sub-sets of entity-char conversions. Each of them can be activated or deactivated. The more sub-sets are activated, the more time is needed for processing (conversion). One reason for excluding some of the sub-sets is the following : sometimes not all the symbols have to be converted, for example: commas, dots, colons, semicolons ....

Example: ("дума" in Bulgarian is the equivalent of "word")

"дума" <-- entity conversion --> "&dcy;&ucy;&mcy;&acy;"

The tool operates on the document which is currently opened in the system. It can be started from the main menu:

  • Tools/Entity Converters/Char --> Entity - Converts all the symbols included in the currently activated conversion sub-sets into entities. The current conversion sub-sets can be seen in Tools/Entity Converters/Entity Management.
  • Tools/Entity Converters/Entity --> Char - It enables the opposite conversion i.e., from entities to symbols (characters).
  • Tools/Entity Converters/Entity Management - This item visualizes the manager window that is responsible for the activation/deactivation of the different sub-sets of entity-char converters.

The dialog window:

The window shows the current active converters (filters). Each of them can be deactivated by removing it from this list (button 'Remove filter').

In order to see the symbols (entities respectively), the user can press the View filter button. Having pressed it, a table appears on the screen containing a detailed information about the filter. Each row represents one pair entity-symbol. The table has 3 columns: the first for the entities, the third for the symbols and the second is for the unicodes of the symbols represented as entities.

In order to activate one or more filters, which are not already in the list, the user can press Add filter button. A new dialog window appears which contains a list of all the available filters, which are not active. By selecting the checkbox opposite to each of then the user can activate filters. Here he/she can see the currently selected (non-active) filter content before adding it (button Preview). Optionally all filters can be added with button Add All.

When the entity filter management has been completed, the new settings can be updated by the Done button and then the window is closed.

There are two more buttons (Apply "Entity --> Char" and Apply "Char --> Entity") which apply the corresponding conversion upon the current document for the system. It performs the same action as the conversion tool from the main menu of CLaRK System. In addition the changes in the active filters(if any) are updated.

Transformations

XPath Transformations

This dialog is used for various "mass" commands for document restructuring. Generally, the scenario is the following: (1) a list of nodes (subtrees, text elements) is chosen by the Source field. In this way it is defined what will be copied or moved in the document; (2) a list of nodes is chosen by the Target field. In this way the place(s) where the source elements will be copied or moved is defined; (3) the elements from source list are attached to the elements of the target list. There are several option defining the way of performing the above action. The concern refer to such thing as whether the elements of the source are copied or cut from the document before being attached to the target, the mapping between the source and the target elements - there are possibilities elements of the source to be attached to each element of the target, or each element of the source to be attached to the corresponding element of the target.

A more detailed description of each field and action follows.

Source

This is a description of the data which is to be copied or moved. The description can be an XPath expression, an XML markup data or some text. If the description is an XPath expression then it is evaluated to a list of nodes depending on the document selected in the combo box under the source field. The XPath expression is evaluated before the actual change of the document had taken place. If the source is XML markup data, then it need not have a root element. The XML markup data is parsed to a list of XML nodes. If only text is given, this very text is considered as one element list.

Copy versus Cut

When the Source is defined by an XPath expression the elements in the resulting list can be either kept left in the document or deleted from it before the rest of the processing. If one chooses to copy elements, first a copy of the source list is created and then the operation proceeds further with that copy. If one uses cut, then the elements are first removed from the tree and then the operation continues. When one cuts the elements, the destination XPath cannot be relative.
Note: If you cut the elements, then they will not be present in the tree when destination XPath is evaluated.

Include subtree check box

If this check box is on, then the source list contains for each chosen node the entire subtree under the chosen node. If it is off, then only the local information for each node is put in the source list. The local information includes the tag name and the set of attributes as well as their values. When only a node with the local information is chosen and it has to be cut then its children are inserted as immediate children of its parent. The insertion is made in the position of the deleted node.

Treat source as XML markup check box

By this check box the treatment of the Source field is controlled. If it is on, then the source is treated as XML markup data . If the XML markup data doesn't contains tags then it is treated as text.
If the check box is off, then the contents of the Source field is treated as an XPath expression.

Target

An XPath defining the target list of nodes, e.g. the nodes where the source will be included.

Absolute or Relative?

When Absolute , the XPath is calculated from the beginning of the document selected in the document combo box under the destination field, e.g. the XPath "self::*" will return the document element.

When Relative, the XPath is calculated for every node from source with this node as a context. As a result you get a list of nodes for each node in the source.

Insert node(s)

Defines the position where the source is to be included, relative to a node in the target list.

As a parent

The nodes from the source become parents (ancestors) of destination.

As a child

The nodes from source become children of the destination in an appropriate position. If the number in the box is less than 0 or it is not a number, then they become last children of the destination nodes.

As a sibling

The nodes from the source become siblings of destination in the appropriate position. 1 means next sibling, 2 means the sibling after the next sibling and so on. -1 means previous sibling, -2 means the sibling before the previous sibling and so on.

Options

Copy all nodes in the source list to every node in the target list

This is allowed only when the target XPath is Absolute. It works according to the selected position.

  • as parent

    The elements in the source list are treated as a path in an XML document where the first element is a parent of the second, the second of the third and so on. If the source list contains whole subtrees, then each element in the source list is consider as a last child of the previous element. The constructed path is inserted in the document so that the first element is inserted in the place of the node from the target list and the target node is inserted as a last child of the last element of the source list. For each node in the target list a new copy of the source list is taken.

  • as child

    The elements in the source list are inserted as children of each node in the target list in the ordering in which they appear in the source list. The insertion is done at the indicated position. For each node in the target list a new copy of the source list is taken.

  • as sibling

    The elements in the source list are inserted as siblings of each node in the target list in the ordering in which they appear in the source list. The insertion is done at the indicated position. For each node in the target list a new copy of the source list is taken.

Although the program always makes a copy, if cut is selected the nodes in the source list will be deleted.

Copy each node from the source list into the corresponding node in the target list

This option allows a pair wise inclusion of the source list into the target list. When the target is Absolute, the first node from the source list is attached to the first node in the target list, the second node from the source list to the second in the target and so on until there are no nodes left in the source or in the target list. When the target is Relative, each node in the source list is attached to the first node in its relative target list. There is an option to check whether the two lists are of equal sizes (when Absolute) or the relative lists contain a single node (when Relative ).

Copy all nodes from the source list as parents of all nodes in the target list

The elements in the source list are treated as a path in an XML document where the first element is a parent of the second, the second of the third and so on. If the source list contains whole subtrees then each element in the source list is considered to be the last child of the previous element. This path is inserted in the document as a common path of parents for all nodes in the target list. This is done by searching for the nearest common parent for all nodes in the target list (note that this parent exists). Then all nodes which are children of that parent and are placed between the first and the last node in the target list are removed and on their place the first node in the source is put instead. Then the removed nodes are added as the last children of the last node in the source list.

When the target XPath is Relative the above procedure is repeated for each node in the target list.

Buttons:

  • OK - applies the chosen replacement (or addition).

    First the Source description is evaluated and the result list is saved in a clipboard like memory. If Cut is chosen, then the corresponding nodes are deleted from the document. After that the Target description is evaluated and then the appropriate attachment of the source is made in the target.

  • Cancel - exits the dialog without any processing.

XSLT transformations

The current document can be transformed via an XSL Transformation. The user is asked to choose a valid XML document which contains the XSLT(It must be an internal document). The result from the transformation is a new document, which is loaded in the system.

Grammars

Edit Grammar

The tool represents a regular expression grammar editor. The user can edit an existing grammar or create a new one. The available grammars can be selected from the combo box at the top of the dialog. The user can create a new grammar, to remove, rename or update a grammar. In order to start editing, the user must first create at least one grammar with the help of the New button. Grammars without a name can not be edited. After each change, the grammar has to be updated with the Update button. Grammars that are not updated are removed from the memory when pressing the Exit button.

Each table line represents one grammar rule. The expressions in column Regular Expression (bodys of the rules) have to match the tokens and mark-up in the document. This column should be always non-empty. The expressions in Left Regular Expression (left context) and Right Regular Expression (right context) columns determine the context in which the matched tokens and mark-up should appear in order for this rule to work. If there is no left or right context specified, the grammar presumes that all contexts are valid. The XML markup in the Return Markup column is used for marking the matched data. A comment field is added for user's own commentary.

The Tokenizer combo box determines the current grammar tokenizer. If a tokenizer is selected it will be used when the system creates the grammar input. If no tokenizer is selected, then the tokenizer from the DTD (Element Features) will be used. The Filter combo box determines the filter that will be used when applying the current grammar. The Matches combo boxes represent the match option for Left Regular Expression (left context), Regular Expression (body) and Right Regular Expression (right context). The Any up and Any Down matches in the body combo box can be used for backtracking. If any Any up option is selected the grammar finds the shortest sequence of tokens and mark-up that that is recognized by the body of a grammar rule and is correct according to the left and right context of this rule. The difference from shortest match is that in shortest match the grammar engine will choose the shortest possible sequence and if the left or right context fails the whole sequence will fail. Example:

In this example we are applying this grammar on the following sequence of symbols: b,a,b,c and the grammar is on the symbol "a". If we use Shortest match then the grammar will use the second rule because it is the shortest possible match and will fail on the left context of this rule and the grammar engine will go to the next symbol "b". If Any Up match is used then the grammar will choose the first rule although it matches a longer sequence.

If Any Down option is selected the grammar finds the longest sequence of tokens and mark-up that is recognized by the body of a grammar rule and is correct according to the left and right context of this rule. Clark grammar engine implies four modes for checking the left and right context:

  1. Left Right - checks the left context first and then the right one.
  2. Right Left - checks the right context first and then the left one.
  3. Backtracking Left - checks the left context first and then the right one. If the right context fails the grammar engine will try to find a longer or shorter sequence of words (depending on the type of match selected for the left context) in order to use the right context of another rule instead. Example:

    In this example we are applying this grammar on the following sequence of symbols: c,c,a,b,a and the grammar is on the symbol "a". If we select Left Right mode the grammar engine will use the first rule because it matches the longest left context but the grammar will fail on the right context. If Backtracking Left mode is selected the grammar engine will prefer the second rule because it is correct even though it is has a shorter left context.

  4. Backtracking Right - checks the right context first and then the left one. If the left context fails the grammar engine will try to find a longer or shorter sequence of words (depending on the type of match selected for the right context) in order to use the left context of another rule instead.

XPath expression (Apply to text field) selects the nodes to which the grammar to be applied.

Buttons:

  • New - creates a new grammar.
  • Update - updates the grammar within the system memory.
  • Remove - removes grammar from the system memory.
  • Rename - renames a grammar.
  • Exit - closes the grammar editor, prompts for unsaved grammars.
  • Save Grammar - saves grammar(s) to file.
  • Load Grammar - loads grammar(s) from a file.
  • Apply Grammar - the user can apply the grammar to the current document (if there is one) using the grammars XPath expression (The user should save the grammar before applying in order to use the new grammar settings).
  • The Feature menu gives an access to the DTD Element Features and Attribute Features.

For each grammar the user can define element values. The element values are XPath expressions that are evaluated for the corresponding elements to determine their value when applying the grammar. If no element value is defined for some element then it is taken from the DTD. Here is an example of element values.

Each row of the table represents an element value for an element. Both columns should not be empty. If one element has two element values the first one is used. When the user presses the OK button the XPath expressions in the table are checked for correctness.

Apply Grammar

This menu item applies a grammar to the current document. In the Choose grammar field the user selects a grammar to apply. In field Select nodes the user has to specify an XPath expression to select nodes on which to apply the grammar. If the grammar has a defined XPath expression it will appear automatically upon grammar selection in the Choose grammar combo box. If no XPath expression is entered, the tool will produce an error message. The user can select also a tokenizer and a filter in the Choose tokenizer and Choose filter combo boxes.

Apply Multiple

It applies one or more grammars or grammar groups on the current document in a cascaded way. The grammars and groups are added to a list which will be executed in the order of the items.

Buttons :

  • Insert Grammar - inserts a grammar;
  • Insert Group - inserts a grammar group;
  • Insert Save - when reaching this spot in the queue the system prompts the user to save the processed document.;
  • Remove - removes the chosen item;
  • Apply - starts applying the constructed grammar queue to the current document;
  • Exit - closes the dialog window.

Grammar Select

This menu item executes a grammar on the current document. As a result the tokens and mark-up recognized by the grammar are selected in the Text Area. Additionally it allows the user to mark the recognized information with an XML mark-up.

Buttons:

  • Search - executes the grammar on the current document;
  • Next - finds the next group of tokens and mark-up that matches the grammar;
  • Previous - finds the previous group of tokens and mark-up that matches the grammar;
  • Mark - marks the selected data with an XML mark-up taken from the grammar or written by the user.
  • Edit - opens the grammar editor tool;
  • Exit - closes this dialog.

Grammar Groups

This menu item represents a grammar group editor. The user can set grouping of grammars in order to apply them together. The groups can be created, modified, removed or renamed. Grammar groups are created in order to enable the user to apply several grammars in cascade style. Grammar groups are applied via the Apply Multiple system tool.

Buttons :

  • Insert Grammar - inserts a new grammar in the grammar group;
  • Remove - removes the selected grammar from the grammar group list.
  • New - creates a new Grammar group;
  • Save - saves the grammar group;
  • Rename - renames the Grammar Group;
  • Remove Group - removes the currently opened grammar group.
  • Exit - closes the dialog window.




Sort Tools

Sort

The sort tool is used for reordering nodes.

For sorting user specifies the following things:

  1. The nodes to be sorted.
  2. The keys for each node.

The first is done by defining an XPath expression in the Select Elements field. If the field is empty the sort tool will return an error message. For context node the XPath engine assumes the node selected in the tree in the main panel. The sort tool compares only element nodes which have a common parent. The sort tool splits the result returned from the XPath evaluation into groups according to the parent node. Each group is sorted separately.

Keys are created for every node we want to sort. Each row in the table represents one key. The sort tool compares two nodes key by key. The key is the list of nodes returned from the XPath engine after evaluating the expression defined in the column Key of the table. The context node in this evaluation coincides with the node for which we want to create the key. The other columns of the table represent settings used in the comparing of the lists. The lists are compared node by node.

  • If the nodes are both elements then the sort tool asks the DTD which one is defined to be smaller (Element Features).
  • If the nodes are both text they are compared by their textual content.
  • The attribute nodes are compared by the textual content of their values only if they have the same name and their parents are elements with same name.
  • The textual content(text) of text and attribute nodes is compared in the following way:
    1. The text is compared symbol by symbol.
    2. If the user chooses tokenizer then the symbols are compared based on the tokens created by the primitive tokenizer of the selected tokenizer (A tokenizer that is ancestor of the selected tokenizer and is primitive. If the selected tokenizer is primitive then this tokenizer will be used for tokenization). The symbols are compared based on their token category (the order of the categories in primitive tokenizers) and by their position in the definition of the token category value. If normalization option is selected, the sort engine will use the primitive tokenizer normalization table to define the symbols token category and value.
    3. If the user selects "No Tokenizer" the sort tool will use the Unicode table to compare symbols. In this case normalization option will mean converting the Capital letters in to Small letters case for Cyrillic and Latin.
    4. If you select the reverse option for the key, the text will be reversed before the comparison ("erga" => "agre").
    5. If you select the trim option for the key, the text will be cleared from the whitespaces (TAB,SPACE,LF,CR) in both ends before comparing.
    6. If you select the number option for the key the text will be converted into numbers and compared by their number value.
  • If the current nodes are not from one type then the following order is relevant: attribute < text < element.
  • If a key for one element has more nodes then a key for another element then it is assumed smaller. This assumption is made when all nodes from the smaller key are equal to the corresponding nodes of the bigger key.

For each key the user can define different order ( Ascending | Descending ). The order of the keys in the table is very important because this is the order in which they will be used. If two keys have equal nodes but one of them has additional elements then the one with smaller number of nodes is considered smaller.

The difference between the DTD sort and the Advanced one is that the sort tool takes the tokenizer and the number option from the DTD (Element Features, Attribute Features). For attribute nodes the sort tool also takes from the DTD the order enumeration values.

Examples:

  • Example 1: Sorting a book by pages and title. The elements to sort are the book children of the context node. They will be sorted by the content in their pages element and title element. Key 1 is the text in the pages element of the book. It will be trimmed and converted to number when sorting. In this key we don't need tokenizer because the whole node will be converted to number. If two elements are equal according to the first key (two books has the same number of pages) then they are compared based on the second Key. Key 2 is the text in the title element of the book. It will be trimmed and normalized when sorting. For normalization the sort tool will use the normalization defined in the "Mixed Word" tokenizer. The order of this key is descending. This means that this key will sort books by the title in reverse order.
  • Example 2: Sorting TEI divisions by their heads. The sort tool takes all divisions in the document and sorts them according to the text in the text in their head element. If a division does not have a head element then it will assumed as smaller.

Example 1

Example 2

Tokenizers

This menu item represents the Tokenizer Editor dialog . The user can select different tokenizers using the "Tokenizer" combo box. Note that there is always at least one tokenizer in the system because the "Default" tokenizer is not editable and can not be removed.

The user can create, remove or save a tokenizer. Each row in the table represents one tokenizers category. The user can add remove and reorder rows with the menu shown when the user right clicks over a row in the table. The first column is the category name. The content of the second column depends on the type of the tokenizer. The column contains category value (all the symbols in the category) if the tokenizer is "primitive" or regular expression for a "non-primitive" tokenizer. Here is an example:

This is a primitive tokenizer which name is Mixed. It has categories LAT, CYR, SYMBOL ... Category LAT represents the Latin symbols. Category NUMBER represents the numbers from 0 to 9...

Buttons:

  • Save - saves the current tokenizer.
  • Remove - removes tokenizer.
  • Sort Order - defines the order of categories of a primitive tokenizer.
  • Change Parent- changes the parent of a non-primitive tokenizer.

In order to create a new tokenizer the user must press the New button and set the options in the new tokenizer dialog.

If the tokenizer is primitive the user must select the Primitive check box. Otherwise the user must indicate the parent of the tokenizer in the Parent combo box. The use an existing tokenizer as basis for the new one with the Use current button. This tokenizer has to be in the dialog window.

When defining the category value for a primitive tokenizer the user should be aware of the following rules:

  1. The characters are quoted with single or double quotations or referenced by a number. Example "." or ";" or 'k' or 32 (Space)
  2. If the user wants to use more than one symbol for category he/she should separate the symbols by a comma. Example "a","b","c",...
  3. If the user wants to define a range of symbols, the starting and ending symbols must be connected with a dash. Example : "A"-"Z".
  4. One character can not have more than one category
  5. A category can be defined on more than one row.It is interpreted as conjunction between the two categories. Here is an example:

    The tokenizer tool will interpreter this lines as LAT "'a'-'z','A'-'Z'".

Here is one example of a primitive tokenizer :

For each primitive the user can define the sort order of the categories by clicking the Sort Order button. Example:

This dialog will be shown if the user presses the Sort Order button when the Mixed tokenizer (the tokenizer in the first example)is on the screen. The use can reorder the categories with the menu shown when the user right clicks over a row in the table.

Also a normalize option can be defined for each category of the tokenizer by right clicking on the category line in the table. For instance the following dialog will appear if we select normalize for the "LAT" category:

For each symbol of the category the user must select a corresponding normalized symbol. The "New Category" combo box determines the new category that the symbols will receive after the normalization. This category can coincide with original category of symbols before the normalization or to be completely different.

When defining a non-primitive tokenizer, the user should follow the following rules:

  1. Each category Regular Expression must be a valid regular expression.
  2. Two categories can not have a common token.

Each non-primitive tokenizer must have a parent tokenizer. The parent of a tokenizer is set when the tokenizer is created and can be changed by the user when pressing the Change Parent button. Here is an example of a non-primitive tokenizer:

ntokenizer.gif (15428 bytes)

The parent of this tokenizer is the "Mixed" primitive tokenizer shown on the first example. This non-primitive tokenizer uses the categories "LAT" and "CYR" from the parent tokenizer.

The user can undo changes made on tokenizer with the Undo button.Tokenizers can be loaded or saved to an external file. The Exit button closes the tokenizer editor dialog. The system prompts the user to save all unsaved tokenizers on exit.

Filters

This menu item starts the filter editor. In order to browse the filters in the system, the user can use the "Filter" combo box at the top of the dialog. The user can add token categories from different tokenizers or add XPath expressions to filter element nodes.

The "Token Types" list is the list of the filtered token categories. The user can take the categories from the tokenizers in the system (The "Choose From" list on the left side of the dialog) and add them to the list of filtered token categories with the arrow ("=>") button. In order to add an XPath for an element filtering the user must press the "add XPath" button. The new XPath expression will be added to the "Expression" list. To remove a token category an XPath from one of the list the user must press the corresponding "Remove" button. The user can remove and save filters or create new ones.

Element Features

The Element Features is used to add information to the elements of a DTD.

The user can add the following additional information:

  1. Tokenizer for the elements of the DTD. Used for tokenization of the element text data by the Grammar and Sort engines.
  2. Default Tokenizer for a DTD. If no Tokenizer is defined for a element the grammar and sort engines will look for the the DTD default tokenizer. After compiling the DTD receives the "Default" primitive tokenizer for default tokenizer.
  3. The user can state that the content of an element is number.
  4. The user can define a XPath Value for each DTD element.
  5. The user can define order over the DTD elements. Option used for sorting purposes.

In order to select the default tokenizer for the DTD the user must select an item in the "Default Tokenizer" combo box. The user can select a tokenizer for each element of the DTD in the "Tokenizer" column of the table by clicking on a table cell. The check boxes in the "Number" column are used by the sort tool to determine whether the content of the current element can be treated as a number. For example pages and price can be treated as numbers during comparing of two books. The values in the "XPath Value" column are used by the Grammar Engine to define the value of the element nodes.

The order of the elements can be defined in the sort table shown when the user press the "Sort Order" button. When comparing two elements the position in sort table defines their order. The user can change position of elements by dragging their rows to correct positions or using the context menu opened when the user right clicks on sort table row. Here is example:

Attribute Features

The Attribute Features is used to add information to the attributes of a DTD.

The "Element" column of the table represent all the elements in the DTD that have an attribute. One element can be on several rows because it can have more than one attribute. In the "Attribute" column are presented all the attributes of the DTD. In the "Tokenizer" column the user can select different tokenizers for each attribute. The sort tool uses the information from the "Number" column to select how to compare the value of the attributes (As plain text or as a number).

An additional feature is the order of enumerated attribute values. The attributes with enumerated values have "(e)" string at the end of the name. In order to sort the enumerated values click on an attribute with enumeration value and click the "Sort Values" button. Example:

The values are sorted in ascending order. The user can change the order by dragging the rows or by the right mouse button.

Extract

The extract tool task is to extract nodes from a document or from multiple documents and to save them as a new document. The text field at the top of the dialog is used for defining an XPath expression which selects the elements in the document(s). The context node for this evaluation is the root node of the document(s). The user can extract from the currently active document in the system or from the internal documents. The result from the extraction is an XML document in which all extracted nodes are children of the root element (This element is named "Extract" by the system).

The "include subtree" option allows the extraction not only of the selected nodes but the entire subtree below as well.

Auxiliary tag - the extracted node can have a parent (prompts for attribute name) element which is used to separate results. For example, if we extract only text nodes then in the new document all the text nodes will be merged. If the auxiliary tag is selected, then the Number and Source prompt will be shown in the dialog.

If "Number" is selected then the extract tool adds an attribute with the extract result number to the auxiliary tag.

If "Source" is selected, then the extract tool adds an attribute with the source document name to the auxiliary tag.

Statistics

The Statistic Dialog is used to show information about the number of the occurrences of some elements.

The user can select elements by an XPath expression ("Search" field). It is recommended always to check this expression in the systems search tool.

The elements are selected from the internal documents in the system. The XPath is applied to the root element for every selected document in the standard "internal document selector".

The selected elements are sorted by sort keys identical to the keys described in the sort tool manual. As we extract information from multiple documents and we can not determine one DTD only "Advanced" sort mode is available.

The user can choose a tokenizer from all tokenizers defined in the system ("Tokenizer" combo box) in order to process the text nodes. The "Tokenizer Category Filter" is used to select the categories derived as result after tokenization with the selected tokenizer. If no tokenizer is selected the text nodes will be processed as a whole node. Only the tokens with the chosen categories are shown in the result.

Buttons: "OK" button shows result. "Cancel" button closes window.

The Result is shown as a table. Here is an example:

The "Category" column contains categories from filter which exist in chosen text nodes or "<Element>" if the row represents element node and "#text" if the node is text.

The "Element" column contains tokenized text (The value of the filtered tokens), or node names.

The "#" column contains number of occurrences of the corresponding item.

The "%" column contains information for percentage of the corresponding item.

The "Key Value" column contains the value of sort keys created for the corresponding node or nothing if the line contains token.

Result can be saved as an XML document following the result table structure. The DTD of the document has the following structure:

<!DOCTYPE statistics [
<!ELEMENT statistics (documents, item*, all)>
<!ELEMENT documents (document+)>
<!ELEMENT document (#PCDATA)>
<!ELEMENT item (category, element, number, percent)>
<!ELEMENT category (#PCDATA)>
<!ELEMENT element (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT percent (#PCDATA)>
<!ELEMENT keyvalue (#PCDATA)>
<!ELEMENT all (number, percent)>
]>

documents tag is a list of selected for statistic documents, where each document name appears in a document tag. item tag corresponds to a line from a result table as follows:

  • category tag corresponds to the "Category" column
  • element tag corresponds to the "Element" column
  • number tag corresponds to the "#" column
  • percent tag corresponds to the "%" column
  • keyvalue tag corresponds to the "Key Value" column

all tag corresponds to the last row of result table. It contains the number of all occurrences of selected elements. And information for percentage

Pressing "Cancel" button closes window and returns to main window of the system.

Concordance

Concordance - System tool for information extraction. The filed at the top of the dialog is used to define the context. If the XPath expression in it is invalid no elements will be extracted. For context the root of the document is used. The user can switch between two types of search with the tabbed pane. The result from the concordance is an XML document in which the found data is separated in lines. A line is a XML document with the following structure :

<L>
<LC> "the left context" </LC>
<I> "the data we are searching for" <I>
<RC> "the right context" </RC>
<COM> "Element for user commentary" </COM>
</L>

1. The grammar search. - The user must select a grammar or create one of his own to search with it. To create a grammar select "<custom grammar>" item in the combo box and press edit button. The concordance tool will open a standard grammar editor with reduced options. After the user have finished editing the system asks whether he/she will use the created grammar. If the answer is no, then the system will presume that there is still no grammar. The user should press the "Save" button if he/she wants to save the newly created grammar in the system memory. If the user selects a grammar from the list then this grammar will be used for searching. If edit button is pressed while a grammar from the list is selected the system will open the selected grammar in a standard grammar editor. On exit the user will be asked whether he/she wants to use this grammar. If the answer is positive the modified grammar will appear on the place of the custom grammar in the combo box. The save option is available only for the custom and modified grammars . The user can select another grammar to restrict the context in which the search grammar will be applied. If the "Text Only" check box is selected the grammar will ignore the mark-up inside the context while checking. Example :

2. The XPath search - if the user uses an XPath to search inside of context the search field (the top text field in the tabbed pane) should contain valid XPath expression. All nodes selected with this expression will become items in the concordance result. If fields for the left or right context are not empty the concordance tool will use the expressions inside them to form the left and right context of the items. If not, the contexts will be generated automatically. Example:

If the "Add Source Attribute ?" is selected, then to every extracted line the concordance tool will add an attribute with the name of the source file. If the "Add Number Attribute ?" is selected, all extracted lines will receive a number attribute.

Table View

The Table View tool is created to represent the information extracted from the concordance tool in more readable table form. Each line of the table represents one line of the concordance result. The data in the "Context" columns does not represent the whole context but only the amount of data that can fit in the column length. At the beginning it is only 30 symbols. To increase the context the user should press the settings button and from there to determine the context in symbols. The user can also set the width of the comment column. If the user wants to see the context without expanding the column data he can do it with right click on the "Left Context" and "Right Context" column. If the user wants to add commentary to a concordance line he can do so by filling value in the "comment" column or by right clicking a row in the "item" column. To navigate fast through the table the user can use the combo box in the top to access a row. The user can sort lines of the table.

The user must select on which column to apply each sort key (which element of the concordance line will be the context LC,I,RC or COM). If no column is selected then the key will be executed with the line element for context.

Useful option of the Table View is the "Edit Layout". The user can filter the tags that are shown in the table. For example, if the POS information is separated in a tag, the user can hide it in order to view only the text.

Node Info

The following item give information about the number of occurrences of specified tags or tokens is a set of internal documents. When the user starts this tool, s/he is asked to provide several things:

  • The type of information which is needed. The two possibilities are: counting tags and counting tokens.
  • The documents for which information is needed.
  • An XPath expression which selects nodes in each document for which the counting will be preformed.

Here is a screen-shot of the initial dialog window:

The major components of the dialog window are:

  • XPath Field - selects the nodes in each document for which the counting will be performed. If tags will be counted, then for each node from the selection of this XPath expression, its descending nodes will be counted. If tokens will be counted, then for each text node from the selection its text content will be tokenized and the result tokens will be counted.
  • Tokenizer Selector - determines which tokenizer will be used when tokenizing the text nodes for token counting. This component is disabled in case of tag counting.
  • Info Type Selector - determines the type of elements, which will be counted. The options are: "Word Info" - for token counting and "Tag Info" - for tag counting.
  • Document Selector - this component is responsible for selecting documents from the internal document database, on which the counting will be applied. This is an universal component for the CLaRK system. For more information see Document Selector in menu File.
  • Show Info button - starts calculating the information for the selected documents.
  • Cancel button - closes the window and cancels further processing.

If the Show Info button is pressed, the system starts to process the selected documents one by one. While processing the documents, the status bar of the system shows the current process state. Having processed all selected documents the system shows the result in a new window. Here are two example results, one for Word Info and one for Tag Info:

  • Word Info

    The first column Document contains the names of the documents chosen from the first dialog.
    The second column Category contains the categories from the tokenizer which the user has chosen before.
    The third column # contains the number of occurrences of each category in the text.

    The content of the table can be saved in a file - if Save if file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.

    If Add information checkbox is selected then the relevant information will be added to each of the documents. The word information added to a document has the following form:

    <extent>
    <interpGrp>
    <interp
    type ="LATw" value="20"></interp>
    <interp
    type ="TAB" value="1"></interp>
    <interp
    type ="CYRw" value="9350"></interp>
    <interp
    type ="NUMBER" value="237"></interp>
    <interp
    type ="SYMBOL" value="129"></interp>
    <interp
    type ="PUNCT" value="1720"></interp>
    <interp
    type ="SPACE" value="9376"></interp>
    </interpGrp>
    </extent>

    If the DTD for a document is TEI <extend> is added in the appropriate position. Otherwise, <extend> is added after the first node.

  • Tag Info

    The first column Document contains the names of the documents chosen from the first dialog.
    The second column Tag contains the tag names of all nodes which the user has chosen with the XPath expression from the first dialog.
    The third column # contains the number of occurrences of each tag in the documents.

    The content of the table can be saved in a file - if Save if file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.

    If Add information checkbox is selected then the relevant information is added to each of the documents. The information added to each document has the following form:

    <encodingDesc>
    <tagsDecl>
    <tagUsage
    gi="hi" occurs="40"></tagUsage>
    <tagUsage
    gi="p" occurs="153"></tagUsage>
    </tagsDecl>
    </encodingDesc>

    If the DTD for a document is TEI <encodingDesc> is added in the appropriate position. Otherwise, it is added after the first node.

Back to Contents