1 Evaluation of Lexware application “Djupindexering”

The evaluation basically consists in a comparison of keywords assigned automatically by Lexware and manually by indexers at Riksdagsbiblioteket to 1403 riksdag’s documents.

The texts of these documents are provided in file txt.zip. A thesaurus of 3969 nodes, which was specially designed by Riksdagsbiblioteket for the purpose of indexing, is provided in file thesaurus.zip. Keywords are restricted to terms of this thesaurus.

Several lists with various sorting of the results of manual and automatic indexation are available to assist the evaluation analysis. File diff.txt consists of a list of documents with two sets of keys each: manually and automatically assigned keys. Documents are identified with a title and an id, the latter need to be used for looking up document texts in txt.zip.

2 Installation

Download file txt.zip
Download file thesaurus.zip
Download file diff.zip

3 Contents of file thesaurus.zip (subdirectory thesaurus)

· rixlextes97.txt - thesaurus in the original form provided by Riksdagsbiblioteket,

· terms.txt - alphabetical list of all terms in the theasaurus,

· termtree.txt - thesaurus formatted as tree of terms (see comments in the beginning of this file).

4 Contents of file diff.zip (subdirectory diff)

· diff.txt – is the main listing of results of manual and automatic indexing. It’s a list of documents each provided with manually and automatically assigned keys.

· stat.txt – a concise table showing in percent the number of matches of the distinguished types (see below sec. 5) between manually and automatically assigned keywords.

· doc_avg.txt, doc_e.txt, doc_ebn.txt, doc_min.txt are differently sorted lists containing: document id, measures: E [%] EBN[%] (jfr sec.5), M-A (manual-automatic), AVG (average weight keyword), MIN (minimum weight keyword) and R% (Reliability), document title and document length in text word tokens.

· keydist.txt – shows the distribution of terms assigned as keywords, ordered from the most to the least frequently used term as keyword. Of total 3969 thesaurus terms 1842 terms were used as keywords for the 1403 indexed documents.

· docperkey.txt – lists documents for each key of the distribution list Number of occurrences of a term as keyword is thus followed by document id, tite, length (in text words).

· mkey.txt, mkeydiv.txt, mtreeall.txt, mtreeusk.txt, man-auto.txt, termslen.txt, termsrel.txt, termsrelx.txt are all listings which show various measures used by Lexware for estimation of relevance of a term as a keyword for a document. These measures are listed below.

5 Types of matches of thesaurus terms

· EQUAL (E) means that the automatically assigned term matches exactly the manually assigned term.

· BROADER (B) means that the automatically assigned term is broader than the manually assigned term. It is less specific, i.e. higher up in the thesaurus tree.

· NARROWER (N) means that automatically assigned term is narrower than the manually assigned term. It is more specific, i.e. lower in the thesaurus tree.

· SIBLING (S) means terms with a common parent.

· TREE (T) means a common parent higher in the thesaurus tree, up to one of 542 root nodes.

· DIFFERENT (D) means that terms are not in either of the above listed relations, i.e. they are truly different.

· NON_TERM (NT) is a keyword not chosen from the terms of the thesaurus, proposed by Lexware also as a possible term to be added to the thesaurus. This happens whenever a salient concept in the document does not have a match in the thesaurus.

6 Document texts - contents of file txt.zip (subdirectory txt)

There are 1403 txt-files with manually assigned keywords. These are used in the comparison with automatic indexation. Each document has an id by which it is identified on the lists comparing results, like diff.txt. This id is used also as the name of the file having the source text of a document.