Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents

Kassie Teshome

Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents

dc.contributor.advisor	Alemayehu Nega (PhD)
dc.contributor.author	Kassie Teshome
dc.date.accessioned	2018-06-25T13:03:17Z
dc.date.accessioned	2023-11-29T04:05:49Z
dc.date.available	2018-06-25T13:03:17Z
dc.date.available	2023-11-29T04:05:49Z
dc.date.issued	2009-04
dc.description.abstract	This study demonstrates how linguistic disambiguation based on semantic vector analysis can improve the effectiveness of an Amharic document query retrieval algorithm. Accurate document retrieval based on query criteria is important in every knowledge domain. The ability to retrieve appropriate documents is made more difficult by the fact that many words can have different meanings in different contexts. If search engines could disambiguate those words, more accurate retrieval of documents should be able to be achieved. For this study, an Amharic disambiguation algorithm was developed based on the principles of semantic vectors and implemented in Java. The disambiguation algorithm was then used to develop a document search engine. A set of 865 Ethiopian Amharic language legal statute documents were selected as the document population that would be searched. Ten queries containing Amharic keywords with ambiguous meaning were selected. An expert was used to identify which documents should ideally be retrieved by each query. Depending on the query, the expert identified between 6 and 25 documents that should be retrieved. The semantic vector query algorithm created in this study was compared to the well known Lucene algorithm. Each query was run using both algorithms. The 20 most relevant documents were identified for each query from each algorithm. For each query, the list of documents retrieved by each algorithm was compared to the list of documents identified by the expert. The number of correct (consistent with the expert’s choices) documents retrieved by each algorithm was measured. ix Results are that the semantic vector algorithm was superior for 6 of the 10 queries (Lucene was superior on 2 queries, and on two they were tied). This difference was not statistically significant. However, if the total number of correct document identifications are taken into account (not just which algorithm was superior for each query) then the semantic vector algorithm averaged 82% correct identification of documents where as the Lucene algorithm was only 49% accurate. This difference was highly statistically significant (p <0.02) less than the level of significant (p<0.05) for rejecting null hypothesis. . The conclusion is that for Amharic legal statute documents, for queries that include ambiguous keywords, the semantic vector algorithm is superior over lucene algorithm. Keywords: word sense disambiguation, semantic vectors, Information retrieval.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/123456789/3315
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Word Sense Disambiguation	en_US
dc.subject	Semantic Vectors	en_US
dc.subject	Information Retrieval	en_US
dc.title	Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Teshome Kassie.pdf
Size:: 701.79 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Environmental Science