Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents
No Thumbnail Available
Date
2009-04
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
This study demonstrates how linguistic d disambiguation based on semantic vector analysis can
improve the effectiveness of an Amharic document query retrieval algorithm.
Accurate document retrieval based on query criteria is important in every knowledge do main.
The ability to retrieve appropriate documents is made more difficult by the fact that many words
can have different meanings in different contexts. If search engines could disambiguate those
words, more accurate retrieval of documents should be able to be achieved.
For this study, an Amharic disambiguation algorithm was developed based on the principles of
semantic vectors and implemented in Java. The disambiguation algorithm was then used to
develop a document search engine.
A set of 865 Ethiopian Amharic language legal statute documents were selected as the
document population that would be searched. Ten queries containing Amharic keywords with
ambiguous meaning were selected. An expert was used to identify which documents should
ideally be retrieved by each query. Depending on the query, the expert identified between 6 and
25 documents that should be retrieved.
The semantic vector query algorithm created in this study was compared to the well known
Lucienne algorithm. Each query was run using both algorithms. The 20 most relevant documents
were identified for each query from each algorithm .
For each query, the list of documents retrieved by each algorithm was compared to the list of
documents identified by the expert. The number of correct (consistent with the expert's choices)
documents retrieved by each algorithm was measured.
Results are that the semantic vector algorithm was superior for 6 of the 10 queries (Lucene was
superior o n 2 queries, and o n two they were tied). This difference was not statistically
significant. However, if the total number of correct document id gentrification are taken into
account (not just which algorithm was superior for each query) the n the semantic vector
algorithm averaged 82% correct identification of documents w here as the Lucienne algorithm was
only 49% accurate. This difference was highly statistical y significant (p <0.02) less t han the
level of significant (p<O.OS) for rejecting null hypothesis.
The conclusion is that for Amharic legal statute documents, for queries that include ambiguous
keywords, the semantic vector algorithm is superior over leucine algorithm.
Keywords: word sense disambiguation, semantic vectors, Information retrieval.
Description
Keywords
word sense disambiguation,, semantic vectors, Information retrieval