Concept-based Amharic Documents Similarity (CADS) Measure
No Thumbnail Available
Date
2013-12
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa,University
Abstract
Similarity measure has significance in the area of NLP applications such as search engme,
in format ion ex traction and document classification. These LP applications are implemented in
Amharic language. However, most of them rely on simple matching techniques or probabil istic
method to measure si mil arity. These approaches do not always accurately capture conceptual
relatedness as measured by humans. Some of the researches try to consider semantic nature of a
document without handling ambiguity of words. In this research, we proposed Concept-based
Amharic Document Simi larity (CADS) by buildin g AmhWordNel.
The objective of this research is to implement effect ive similarity measure of documents by
considering issues like pol yscmy, synonymy and semantic relationship between words. The
main components of the proposed system (CADS) are AmhWordNet and Concept-based
Simil arity Measure (CSM). CSM consists of Word Sense Disambiguation (WSD), Concept Trec
Extraction and Semantic Similarity Measure modul es.
The Amh WordNet is used as input during concept tree extraction and to implement WSD
modul e. The extracted concept tree together with WSD module helps to lind the semantic
similarity between words. The output of word similarity is used to compute se ntence similarity.
Finally document similarity is computed based on sentence similarities.
The performance of CADS is evaluated using prec ision, recall and F-measure evaluation metri cs.
CADS without WSD (CADS WoWS D), Pointwise Mutual Information (PMI), Jaccard and
Cosine similarity measures are implemented so that comparison between the fi ve systcms is
done. According to the result we get from the experimcnt we conducted, the proposed system has
better performance than the existing ones.
Description
Keywords
Word Sense Disambiguation, Concept Tree Extraction, Amharic WordNet, Concept-based Similarity Measure.