Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents

dc.contributor.advisorAlemayehu Nega (PhD)
dc.contributor.authorKassie Teshome
dc.date.accessioned2018-06-25T13:03:17Z
dc.date.accessioned2023-11-29T04:05:49Z
dc.date.available2018-06-25T13:03:17Z
dc.date.available2023-11-29T04:05:49Z
dc.date.issued2009-04
dc.description.abstractThis study demonstrates how linguistic disambiguation based on semantic vector analysis can improve the effectiveness of an Amharic document query retrieval algorithm. Accurate document retrieval based on query criteria is important in every knowledge domain. The ability to retrieve appropriate documents is made more difficult by the fact that many words can have different meanings in different contexts. If search engines could disambiguate those words, more accurate retrieval of documents should be able to be achieved. For this study, an Amharic disambiguation algorithm was developed based on the principles of semantic vectors and implemented in Java. The disambiguation algorithm was then used to develop a document search engine. A set of 865 Ethiopian Amharic language legal statute documents were selected as the document population that would be searched. Ten queries containing Amharic keywords with ambiguous meaning were selected. An expert was used to identify which documents should ideally be retrieved by each query. Depending on the query, the expert identified between 6 and 25 documents that should be retrieved. The semantic vector query algorithm created in this study was compared to the well known Lucene algorithm. Each query was run using both algorithms. The 20 most relevant documents were identified for each query from each algorithm. For each query, the list of documents retrieved by each algorithm was compared to the list of documents identified by the expert. The number of correct (consistent with the expert’s choices) documents retrieved by each algorithm was measured. ix Results are that the semantic vector algorithm was superior for 6 of the 10 queries (Lucene was superior on 2 queries, and on two they were tied). This difference was not statistically significant. However, if the total number of correct document identifications are taken into account (not just which algorithm was superior for each query) then the semantic vector algorithm averaged 82% correct identification of documents where as the Lucene algorithm was only 49% accurate. This difference was highly statistically significant (p <0.02) less than the level of significant (p<0.05) for rejecting null hypothesis. . The conclusion is that for Amharic legal statute documents, for queries that include ambiguous keywords, the semantic vector algorithm is superior over lucene algorithm. Keywords: word sense disambiguation, semantic vectors, Information retrieval.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/3315
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectWord Sense Disambiguationen_US
dc.subjectSemantic Vectorsen_US
dc.subjectInformation Retrievalen_US
dc.titleWord Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documentsen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Teshome Kassie.pdf
Size:
701.79 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: