Design of Hidden Web Crawler Using Word2vec Model

dc.contributor.advisorGetahun, Fekade (PhD)
dc.contributor.authorKebede, Engdawerk
dc.date.accessioned2021-03-31T07:18:04Z
dc.date.accessioned2023-11-29T04:06:24Z
dc.date.available2021-03-31T07:18:04Z
dc.date.available2023-11-29T04:06:24Z
dc.date.issued2020-10-09
dc.description.abstractWorld Wide Web (WWW) is a huge repository of hyperlinked documents containing useful information. WWW can be broadly classified in two type’s i.e. surface web and hidden web from the user’s point of view. The surface web consists of static hyperlinked web pages that can be crawled and index by general search engine. On the other hand the hidden web refers to the dynamic web pages which can be accessed through specific query interfaces. Web crawler is program that is specialized in downloading web contents. Conventional web crawler can easily search and analyze the surface web having interlinked html pages but they have the limitations in fetching the data from deep web due to the query interface. To access deep web, a user must request for information from a particular database through a query interface. Traditional web crawler can easily crawl surface web, but not able to crawl the hidden portion of the web. These traditional crawlers retrieve contents from web pages, which are linked by hyperlinks ignoring the information hidden behind form pages, which cannot be extracted using simple hyperlink structure. Thus, they ignore large amount of data hidden behind search forms. In this work, we propose a hidden web crawler using word2vec model the proposed crawling approach use e-commerce product review text word2vec model to extract relevant keyword from e-commerce hidden web page. Once automatically extract keywords considering semantics relatedness between words to fill fields of a hidden web form leads to more accurate and relevant result. The results of the proposed approach are analyzed and found as per our expectation.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/25831
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectHidden Weben_US
dc.subjectSurface Weben_US
dc.subjectHidden Web Crawleren_US
dc.subjectWord 2vec Modelen_US
dc.titleDesign of Hidden Web Crawler Using Word2vec Modelen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Engdawerk Kebede 2020.pdf
Size:
2.08 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: