Efficient Language Independent Text Summarization Using Graph Based Approach

No Thumbnail Available

Date

2015-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Concise and Informative summary facilitates information consumption, by shortening large documents focusing on important topics in the document. Automatic summary generation can be abstractive or extractive where the abstractive generation tries to rewrite concepts in a document the extractive one summarizes by selecting the most important sentences from the document. Much of the automatic text summarization researches have focused on the extractive summary due to the Natural Language Processing (NLP) requirement for the abstractive summary generation, and the requirement of NLP in generating abstract summaries becomes a deterrent factor for further research in abstractive summary generation as NLP itself is yet another Computer Science field which is in its developmental stage. Interest in automating the process of document summarization began in the late 1950, and through the past six to seven decades a number of researchers have introduced different approaches to automatic text summarization. Generally speaking the extractive summarization techniques can be classified as supervised, which requires training data and unsupervised, which doesn’t require training data. A summary can also be generic or genre specific. In this thesis we proposed a graph based automatic text summarization algorithm which is generic and unsupervised. In graph based text summarization sentences in the document are represented as nodes of the graph and similarity between them as edges between the nodes, and the nodes (sentences) will be ranked based on their similarity with other sentences. Previous works using graph based approach, namely TextRank and LexRank adopt already existing ranking algorithms which are iterative and that were originally designed for ranking web pages by analyzing their citation links. Given the fact that there are some differences between the graphical representations of web references and text documents we proposed a new algorithm which will rank each sentence node of a document graphical representation in a more efficient way than PageRank and generate a better informative summary. Two algorithms called Independent Rank (IR) and Sentence Rank (SR) which will be used in combination to rank each sentence. The IR rank each sentence based on the degree of the sentence and the weight of its edges, giving us the measure of how important that sentence is in that document assuming that a sentence is important if it shares contents with more sentences. SR follows similar line of thinking with the only difference of considering the importance of those sentences to which it is sharing a content i.e. the IR of the sentence, the main assumption being a sentence is important not only depending the size of its edges and its degree but also based on the importance of those sentences to which it is sharing a content with. And to take advantage of our newly proposed approach we suggested a new similarity measures which simply counts the number of contents shared between two sentences regardless of the size of the sentences and we called this new similarity measure Content Overlap (CO). II Our proposed algorithm reduces the polynomial order iterations (which starts with a minimum of n4 of iterations) of PageRank represented by the big O notation as O(nc), where c depends on the number of sentences n, the weight of edges, and the convergence value selected to a maximum of quadratic order iteration represented by the big O notation as O(n2) and generates a better informative summary which is evidenced by the improved Recall Oriented Understudy for Gisting Evaluation (ROUGE) result. Our algorithm reported a ROUGE-1 result of 0.5238 on half of 2002 DUC dataset for stemmed and stop words removed summarization which is an elevated improvement both against the top performing system in DUC 2002, the highest of them reporting 0.4405 ROUGE-1 result and TextRank, that uses a graph based approach, with reported ROUGE-1 result of 0.4229 on the same data set. Generally graph based sentence ranking algorithms are language independent, and our new algorithm is language independent which we have shown by building a prototype which was experimented on English and Amharic test data. Keywords: Summarization, Graph Based Sentence Ranking, Node Centrality Measures.

Description

Keywords

Summarization; Graph Based Sentence Ranking; Node Centrality Measures

Citation

Collections