Design of Hidden Web Crawler Using Word2vec Model
No Thumbnail Available
Date
2020-10-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
World Wide Web (WWW) is a huge repository of hyperlinked documents containing useful information. WWW can be broadly classified in two type’s i.e. surface web and hidden web from the user’s point of view. The surface web consists of static hyperlinked web pages that can be crawled and index by general search engine. On the other hand the hidden web refers to the dynamic web pages which can be accessed through specific query interfaces. Web crawler is program that is specialized in downloading web contents. Conventional web crawler can easily search and analyze the surface web having interlinked html pages but they have the limitations in fetching the data from deep web due to the query interface. To access deep web, a user must request for information from a particular database through a query interface.
Traditional web crawler can easily crawl surface web, but not able to crawl the hidden portion of the web. These traditional crawlers retrieve contents from web pages, which are linked by hyperlinks ignoring the information hidden behind form pages, which cannot be extracted using simple hyperlink structure. Thus, they ignore large amount of data hidden behind search forms.
In this work, we propose a hidden web crawler using word2vec model the proposed crawling approach use e-commerce product review text word2vec model to extract relevant keyword from e-commerce hidden web page. Once automatically extract keywords considering semantics relatedness between words to fill fields of a hidden web form leads to more accurate and relevant result. The results of the proposed approach are analyzed and found as per our expectation.
Description
Keywords
Hidden Web, Surface Web, Hidden Web Crawler, Word 2vec Model