Concept-based Amharic Documents Similarity (CADS) Measure

No Thumbnail Available

Date

2013-12

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa,University

Abstract

Similarity measure has significance in the area of NLP applications such as search engme, in format ion ex traction and document classification. These LP applications are implemented in Amharic language. However, most of them rely on simple matching techniques or probabil istic method to measure si mil arity. These approaches do not always accurately capture conceptual relatedness as measured by humans. Some of the researches try to consider semantic nature of a document without handling ambiguity of words. In this research, we proposed Concept-based Amharic Document Simi larity (CADS) by buildin g AmhWordNel. The objective of this research is to implement effect ive similarity measure of documents by considering issues like pol yscmy, synonymy and semantic relationship between words. The main components of the proposed system (CADS) are AmhWordNet and Concept-based Simil arity Measure (CSM). CSM consists of Word Sense Disambiguation (WSD), Concept Trec Extraction and Semantic Similarity Measure modul es. The Amh WordNet is used as input during concept tree extraction and to implement WSD modul e. The extracted concept tree together with WSD module helps to lind the semantic similarity between words. The output of word similarity is used to compute se ntence similarity. Finally document similarity is computed based on sentence similarities. The performance of CADS is evaluated using prec ision, recall and F-measure evaluation metri cs. CADS without WSD (CADS WoWS D), Pointwise Mutual Information (PMI), Jaccard and Cosine similarity measures are implemented so that comparison between the fi ve systcms is done. According to the result we get from the experimcnt we conducted, the proposed system has better performance than the existing ones.

Description

Keywords

Word Sense Disambiguation, Concept Tree Extraction, Amharic WordNet, Concept-based Similarity Measure.

Citation

Collections