Ethio-Semitic Proto-Language Reconstruction with In-Context Learning and LSTM Encode-Decode Model
| dc.contributor.advisor | Fitsum Assamnew | |
| dc.contributor.author | Elleni Sisay | |
| dc.date.accessioned | 2026-04-24T13:27:49Z | |
| dc.date.available | 2026-04-24T13:27:49Z | |
| dc.date.issued | 2024-12 | |
| dc.description.abstract | As language evolve, it change and words obtain new meanings and lose old ones, making their reconstruction a critical area of study. Proto-EthioSemitic languages, in particular remain underexplored despite their cultural and historical significance. This research investigates Historical Language Reconstruction (HLR) for Proto-EthioSemitic languages in word level, focusing on two core objectives: cognate identification and proto-word reconstruction. A three-way dictionary was used to compile a dataset of 14,100 semantically related words from Amharic, Ge’ez, and Tigrinya. Linguists manually identified a golden data set with 74 cognate pairs from the Swadesh list concept translated into the three languages of interest and reconstructed proto-forms, while using automated methods (SCA and LexStat) extracted an additional 1,847 cognates from the dataset, significantly enhancing scale. Building on these results, synthetic proto-forms were generated using in-context learning with GPT-4o, based on its performance of achieving a reconstruction accuracy of 85% when evaluated against the golden data. Furthermore, an LSTM-based encodedecode model was trained on the generated data to predict proto-forms from cognates, achieving a prediction accuracy of 91% and an average edit distance of 0.21. This work establishes a foundation for reconstructing ancestral languages within the Afro-Semitic family by integrating linguistic expertise, automated cognate extraction tools, and state-of the-art large language models. The findings underscore the potential of interdisciplinary approaches in preserving and understanding linguistic heritage, with implications for future studies in historical linguistics and language preservation. | |
| dc.identifier.uri | https://etd.aau.edu.et/handle/123456789/8078 | |
| dc.language.iso | en_US | |
| dc.publisher | Addis Ababa University | |
| dc.subject | Cognates | |
| dc.subject | Proto-word | |
| dc.subject | In-context learning | |
| dc.subject | GPT 4o | |
| dc.subject | LSTM based encodedecode. | |
| dc.title | Ethio-Semitic Proto-Language Reconstruction with In-Context Learning and LSTM Encode-Decode Model | |
| dc.type | Thesis |