Topic-Based Amharic Text Summarization
No Thumbnail Available
Date
2011-03
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Automatic text summarization is important in today’s information age where vast amount of
information are produced for consumption. The case of Ethiopia is not an exception. The country has
seen steady growth in digital content, ready for consumption by the mass. Compared to other
international languages, text summarization works in Ethiopia’s local languages in general and the
Amharic language in particular, can be said to be in their early stages of development. In this regard,
more work should be carried out to meet present and future needs of the availability of high quality
information that needs to be extracted from large collections of data in a timely manner.
This thesis investigates the problem of building a concept-based single-document Amharic text
summarization system. Because local languages like Amharic lack extensive linguistic resources, we
propose to use statistical approaches called topic modeling to create our text summarizer. The
proposed algorithms are language and domain independent and hence can also be used for other local
languages. More specifically, we propose to use the topic modeling approach of probabilistic latent
semantic analysis (PLSA).
We show that a principled use of the term by concept matrix that results from a PLSA model can
help produce summaries that capture the main topics of a document. We propose six algorithms to
help explore the use of the term by concept matrix. All of the algorithms have two common steps. In
the first step, keywords of the document are selected using the term by concept matrix. In the second
step, sentences that best contain the keywords are selected for inclusion in the summary. To take
advantage of the kind of texts we experiment with (news articles) the algorithms always select the
first sentence of the document for inclusion in the summary.
We evaluated the proposed algorithms for precision/recall for summaries of 20%, 25% and 30%
extraction rates. The best results achieved are as follows: 0.45511 at 20%, 0.48499 at 25% and
0.52012 at 30%. We also compared our systems with previous summarization methods that have
been developed for other languages based on topic modeling approaches using our Amharic data set.
Our results show that the proposed algorithms perform better at all extraction rates.
Keywords: Amharic Text Summarization, Keyword Approach, Probabilistic Latent Semantic Analysis
Description
Keywords
Amharic Text Summarization; Keyword Approach; Probabilistic Latent Semantic Analysis