Incorporation of Relevance Data in the Term Discrimination Value
No Thumbnail Available
Date
1987-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Indexing in information retrieval is used to obtain a suitable vocabulary
of index terms and optimum assignment of these terms to documents for
increasing the effectiveness and efficiency of the the retrieval system. A
great many automatic indexing models have been developed over the years
in an effort to produce indexing methods that are both effective and usable
in practice. One of the most elegant approaches for automatic selection
and weighting of index terms is the term discrimination value that has been
developed by Salton and his co-workers. This model ranks the index terms
in accordance with how well they are able to discriminate the documents
of a collection from each other; that is, the value of an index term depends
on how much the average separation between individual documents changes
when the given term is assigned for content identification. It is suggested
that the most useful index terms, those which achieve greatest separation,
are the medium frequency terms.
Since the basic requirement in effective retrieval is the separation between
documents which are relevant to a given query and documents which
are not relevant to that query, a more complete picture of a term behavior
may be obtained by the consideration of its ability to effect greater separation
between relevant and non-relevant documents while at the same time
moving relevant documents close to each other.
This study was aimed at testing the extent to which the discrimination
value model considers relevance characteristics of documents in ranking the
index terms. An over-view of the more important ideas current in automatic
indexing is provided. The term discrimination value model is discussed
in greater detail. An efficient technique for computing exact term
discrimination values for relevant - non-relevant document distinction is introduced.
The study is conducted using the KEEN, CRANFIELD, EVANS,
HARDING and LISA document collections and their associated queries and
relevance judgments
While some of the results are consistent with those derived by previous
workers, in some cases, specially in the case of relevant - relevant discrimination,
the results obtained appear to be in complete disagreement with
that of Slaton’s theory: that the medium frequency terms are not the most
useful terms.
Description
Keywords
Information Science