The Automatic Extraction of Bibliographic Information from Locally Published Journals in Ethiopia: A Feasibility of OCR
No Thumbnail Available
Date
2000-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Research and development communities use journals as mechanisms of communications
among themselves. As the size of research output increases ji-om time to time, however, it was
impossible to access each and every report that appeared in journals. Therefore, journal
articles have to be indexed to facilitate access and control. The activity of indexing has to be
systematic, so that research outputs remain accessible to the scientific COIIIIIIUllity. To
achieve this lofty goal, indexing has to be made on regionallnational basis to serve as part of
the universal bibliographic control of journals.
In order to maintain the goal of collecting and indexing publications produced in the
country, Ethiopia has established a bibliographic control centre called the Legal Deposit and
National Bibliography Team (LDANBT) which is affiliated to the National Archives and
Library Agellcy(NALA). Unfortunately, the LDANBT has produced a jourl/al index (for
article level access) neither ill printed nor electronic format. In this thesis, therefore, ell/
attempt has been made to develop programme modules that automatically create electronic
records out of OCR text obtained ji-om printed journal article title pages. In doing so, the
nature of national bibliographic control with respect to journal articles is discussed. As wel/,
techniques of automatically generating bibliographic records from different printed
documents is examined. These techniques mainly consist of document analysis and document
understanding, which are based on the geometric and non-geometric features of documents.
For document analysis, two levels of segmentation are used. The first level segmentation
divides an input text into four zones (first text zone -- consisting ofjourl/al title, volume, issue
n!/lnber, year and page range --, article title, author (s) and author abstract) using white line
spacing as the end of a text zOll e. The secolld level segmentation degenerates the cOlltellts ofthe first text zone into journal title, volume, and issue number, year and page range. The
results of the two level segmentation algorithms are then considered for field classification
(document understanding). Classification of fields is made based 011 geometric and nongeometric
features. The geometric feature zOlle order is used to label article title, author (s)
alld author abstract. On the other hand the lion-geometric features (different punctuation
marks consisting of comma, colon, braces, etc.) serves to label the fields in the first text zone
as journal title, volume, issue number, year, and page range. The system is 85.57 %
successful in correctly segmenting and labelling bibliographic fields. The recognised fields
are converted to ISO 2709 format to export into CDS/ISIS for Windows.
Description
Keywords
Automatic Extraction of Bibliographic