Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents

Kassie, Teshome

Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents

Files

Teshome Kassie.pdf (17 MB)

Date

2009-04

Authors

Kassie, Teshome

Publisher

Addis Ababa University

Abstract

This study demonstrates how linguistic d disambiguation based on semantic vector analysis can improve the effectiveness of an Amharic document query retrieval algorithm. Accurate document retrieval based on query criteria is important in every knowledge do main. The ability to retrieve appropriate documents is made more difficult by the fact that many words can have different meanings in different contexts. If search engines could disambiguate those words, more accurate retrieval of documents should be able to be achieved. For this study, an Amharic disambiguation algorithm was developed based on the principles of semantic vectors and implemented in Java. The disambiguation algorithm was then used to develop a document search engine. A set of 865 Ethiopian Amharic language legal statute documents were selected as the document population that would be searched. Ten queries containing Amharic keywords with ambiguous meaning were selected. An expert was used to identify which documents should ideally be retrieved by each query. Depending on the query, the expert identified between 6 and 25 documents that should be retrieved. The semantic vector query algorithm created in this study was compared to the well known Lucienne algorithm. Each query was run using both algorithms. The 20 most relevant documents were identified for each query from each algorithm . For each query, the list of documents retrieved by each algorithm was compared to the list of documents identified by the expert. The number of correct (consistent with the expert's choices) documents retrieved by each algorithm was measured. Results are that the semantic vector algorithm was superior for 6 of the 10 queries (Lucene was superior o n 2 queries, and o n two they were tied). This difference was not statistically significant. However, if the total number of correct document id gentrification are taken into account (not just which algorithm was superior for each query) the n the semantic vector algorithm averaged 82% correct identification of documents w here as the Lucienne algorithm was only 49% accurate. This difference was highly statistical y significant (p <0.02) less t han the level of significant (p<O.OS) for rejecting null hypothesis. The conclusion is that for Amharic legal statute documents, for queries that include ambiguous keywords, the semantic vector algorithm is superior over leucine algorithm. Keywords: word sense disambiguation, semantic vectors, Information retrieval.

Keywords

word sense disambiguation,, semantic vectors, Information retrieval

URI

http://etd.aau.edu.et/handle/123456789/21819

Collections

Computer Science

Full item page

Word Sense Disambiguation for Amharic Text Retrieval: A Case Study for Legal Documents

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections