PSICL

Annotation with lexical categories (abstract)

Annotation becomes more and more knowledge intensive, which opens up opportunities for more precise queries and at the same time requires that attention is paid to language objects and structures present in annotation. Counts of occurrences of grammatical words are sometimes equated with counts of basic concepts which are associated with these words. In order to obtain a better approximation of concept occurrences something more precise than lemmas, yet not as specific and controversial as word senses would make the estimation of concept occurrences more precise. Lemma occurrences in free compositions need to be separated from those in set constructions because occurrences of the first type correspond more truly to basic concepts associated with a lemma, while occurrences of the latter type disclose only lexical productivity of a lemma. Support verbs are good examples of this latter use - "taking a decision" is not an instance of actually taking anything.
An example of how this separation can be approached is provided by in the present investigation of Swedish texts performed with a lexicon-based corpus tool Lexware. The objective of the investigation was to extract relevant material for an active Swedish-Polish compounding dictionary - verbal compounds which are not wholly compositional and new, i.e. not listed in the largest recent Swedish-Polish dictionary.
Occurrences of verb lemmas as simple and as compounded words were compared first. Participation in word formation was calculated for each verb lemma (called "compounding ability"): as expected auxiliaries were absent in word formation, but there were also some content verbs unexpectedly totally absent in compounding. Calculation of another measure - "compounding productivity" of verb lemmas was the next step in encircling lemmas with highest relevance in non-compositional compound formation. This measure shows how many different collocates a lemma attracts, for instance complements for "-baserad"="based" vary throught the whole range of nominal lemmas. Sorting of the selected material follows the assumption that the more operationalised a lemma is as a compound component the broader collocate register it has. The pattern of some of the lemmas selected thus as operators in word formation can be described with selectional restrictions instead of collocational restrictions.
The investigation helps separate lemmas operationalised in word formation from their origin lemmas and provide separate counts of the two. For instance, "fri"="free" does not occur in its basic sense when used in compounds, such as "barnfri"="child free" is not free as a child only free from children.