Design of Local Web Content Observatory System

Tsegaye, Gashaw

Design of Local Web Content Observatory System

Files

Gashaw Tsegaye.pdf (1.02 MB)

Date

2015-03

Authors

Tsegaye, Gashaw

Publisher

Addis Ababa University

Abstract

The amount of information on the Web as well as the number of Internet users on the Web is growing rapidly. The Web contents are becoming more multilingual and on diverse subjects. Considering a particular group or country, it is very difficult to know how much Web contents are published and which are in what language and on what specific subject. Knowing the status of local Web content of a country or a culture is of critical importance for making an informal decision on policy and strategy design for the development of the multi-lingual and multi-cultural Web. This research work is therefore aimed to design a local Web content observatory system that measures and reports periodically the qualitative and quantitative content of different domains. The local Web content observatory system mainly consists of four components – the crawler, content extractor, statistical tracker, language identifier, Web document categorizer and report generator. The crawler downloads documents and then the language identifier detects the language of each crawled Web document and inserts detected language into a database. The statistical tracker monitors the crawler and records statistical data. The Web document categorizer categorizes the collected documents into the selected type of subject. The report generator provides statistical information about the detected language and distribution of Web document per language across the selected sets of domains. To test and evaluate the system, we have selected all domains hosted under the .et domain. Accordingly about two thousand seed URLs under the .et domain are used and the crawler collected around 263,031 Web documents. According to the accuracy rate measures employed to the language identifier, accuracy rate of 98.67% obtained. To demonstrate the effectiveness of the local Web content categorizer precision, recall and F-measures test were conducted and average precision of 91.7%, recall of 97.2% and F-measures of 94.25% obtained for English document and precision of 91.7%, recall of 87.85% and F-measures of 86.65% obtained for Amharic document. The average accuracy rate of the statistical tracker is 98.72%. Key words: Information Retrieval, Crawler, Language Identification, Web Document Categorization, Local Web content Observatory System

Keywords

Information Retrieval ;Crawler; Language Identification; Web Document Categorization; Local Web Content Observatory System

URI

http://etd.aau.edu.et/handle/123456789/1935

Collections

Environmental Science

Full item page

Design of Local Web Content Observatory System

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections