Incorporation of Relevance Data in the Term Discrimination Value

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Indexing in information retrieval is used to obtain a suitable vocabulary of index terms and optimum assignment of these terms to documents for increasing the effectiveness and efficiency of the the retrieval system. A great many automatic indexing models have been developed over the years in an effort to produce indexing methods that are both effective and usable in practice. One of the most elegant approaches for automatic selection and weighting of index terms is the term discrimination value that has been developed by Salton and his co-workers. This model ranks the index terms in accordance with how well they are able to discriminate the documents of a collection from each other; that is, the value of an index term depends on how much the average separation between individual documents changes when the given term is assigned for content identification. It is suggested that the most useful index terms, those which achieve greatest separation, are the medium frequency terms. Since the basic requirement in effective retrieval is the separation between documents which are relevant to a given query and documents which are not relevant to that query, a more complete picture of a term behavior may be obtained by the consideration of its ability to effect greater separation between relevant and non-relevant documents while at the same time moving relevant documents close to each other. This study was aimed at testing the extent to which the discrimination value model considers relevance characteristics of documents in ranking the index terms. An over-view of the more important ideas current in automatic indexing is provided. The term discrimination value model is discussed in greater detail. An efficient technique for computing exact term discrimination values for relevant - non-relevant document distinction is introduced. The study is conducted using the KEEN, CRANFIELD, EVANS, HARDING and LISA document collections and their associated queries and relevance judgments While some of the results are consistent with those derived by previous workers, in some cases, specially in the case of relevant - relevant discrimination, the results obtained appear to be in complete disagreement with that of Slaton’s theory: that the medium frequency terms are not the most useful terms.



Information Science