The Automatic Extraction of Bibliographic Information from Locally Published Journals in Ethiopia: A Feasibility of OCR

dc.contributor.advisorBirru, Getachew (PhD)
dc.contributor.authorYifru, Enchalew
dc.date.accessioned2022-05-11T09:40:11Z
dc.date.accessioned2023-11-18T12:48:47Z
dc.date.available2022-05-11T09:40:11Z
dc.date.available2023-11-18T12:48:47Z
dc.date.issued2000-05
dc.description.abstractResearch and development communities use journals as mechanisms of communications among themselves. As the size of research output increases ji-om time to time, however, it was impossible to access each and every report that appeared in journals. Therefore, journal articles have to be indexed to facilitate access and control. The activity of indexing has to be systematic, so that research outputs remain accessible to the scientific COIIIIIIUllity. To achieve this lofty goal, indexing has to be made on regionallnational basis to serve as part of the universal bibliographic control of journals. In order to maintain the goal of collecting and indexing publications produced in the country, Ethiopia has established a bibliographic control centre called the Legal Deposit and National Bibliography Team (LDANBT) which is affiliated to the National Archives and Library Agellcy(NALA). Unfortunately, the LDANBT has produced a jourl/al index (for article level access) neither ill printed nor electronic format. In this thesis, therefore, ell/ attempt has been made to develop programme modules that automatically create electronic records out of OCR text obtained ji-om printed journal article title pages. In doing so, the nature of national bibliographic control with respect to journal articles is discussed. As wel/, techniques of automatically generating bibliographic records from different printed documents is examined. These techniques mainly consist of document analysis and document understanding, which are based on the geometric and non-geometric features of documents. For document analysis, two levels of segmentation are used. The first level segmentation divides an input text into four zones (first text zone -- consisting ofjourl/al title, volume, issue n!/lnber, year and page range --, article title, author (s) and author abstract) using white line spacing as the end of a text zOll e. The secolld level segmentation degenerates the cOlltellts ofthe first text zone into journal title, volume, and issue number, year and page range. The results of the two level segmentation algorithms are then considered for field classification (document understanding). Classification of fields is made based 011 geometric and nongeometric features. The geometric feature zOlle order is used to label article title, author (s) alld author abstract. On the other hand the lion-geometric features (different punctuation marks consisting of comma, colon, braces, etc.) serves to label the fields in the first text zone as journal title, volume, issue number, year, and page range. The system is 85.57 % successful in correctly segmenting and labelling bibliographic fields. The recognised fields are converted to ISO 2709 format to export into CDS/ISIS for Windows.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/12345678/31614
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectAutomatic Extraction of Bibliographicen_US
dc.titleThe Automatic Extraction of Bibliographic Information from Locally Published Journals in Ethiopia: A Feasibility of OCRen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Enchalew Yifru.pdf
Size:
19.66 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: