The Automatic Extraction of Bibliographic Information from Locally Published Journals in Ethiopia: A Feasibility of OCR

No Thumbnail Available

Date

2000-05

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Research and development communities use journals as mechanisms of communications among themselves. As the size of research output increases ji-om time to time, however, it was impossible to access each and every report that appeared in journals. Therefore, journal articles have to be indexed to facilitate access and control. The activity of indexing has to be systematic, so that research outputs remain accessible to the scientific COIIIIIIUllity. To achieve this lofty goal, indexing has to be made on regionallnational basis to serve as part of the universal bibliographic control of journals. In order to maintain the goal of collecting and indexing publications produced in the country, Ethiopia has established a bibliographic control centre called the Legal Deposit and National Bibliography Team (LDANBT) which is affiliated to the National Archives and Library Agellcy(NALA). Unfortunately, the LDANBT has produced a jourl/al index (for article level access) neither ill printed nor electronic format. In this thesis, therefore, ell/ attempt has been made to develop programme modules that automatically create electronic records out of OCR text obtained ji-om printed journal article title pages. In doing so, the nature of national bibliographic control with respect to journal articles is discussed. As wel/, techniques of automatically generating bibliographic records from different printed documents is examined. These techniques mainly consist of document analysis and document understanding, which are based on the geometric and non-geometric features of documents. For document analysis, two levels of segmentation are used. The first level segmentation divides an input text into four zones (first text zone -- consisting ofjourl/al title, volume, issue n!/lnber, year and page range --, article title, author (s) and author abstract) using white line spacing as the end of a text zOll e. The secolld level segmentation degenerates the cOlltellts ofthe first text zone into journal title, volume, and issue number, year and page range. The results of the two level segmentation algorithms are then considered for field classification (document understanding). Classification of fields is made based 011 geometric and nongeometric features. The geometric feature zOlle order is used to label article title, author (s) alld author abstract. On the other hand the lion-geometric features (different punctuation marks consisting of comma, colon, braces, etc.) serves to label the fields in the first text zone as journal title, volume, issue number, year, and page range. The system is 85.57 % successful in correctly segmenting and labelling bibliographic fields. The recognised fields are converted to ISO 2709 format to export into CDS/ISIS for Windows.

Description

Keywords

Automatic Extraction of Bibliographic

Citation