The Research of Incremental Deep Web Acquisition Technology
|Course||Computer Software and Theory|
|Keywords||Deep Web Data Source Discovery Incremental Data Access Data Annotation|
As the rapid development of World Wide Web, there is tremendous information“hidden”in Deep Web and its capacity is increasing rapidly. Increased Deep Web information has become an important source of access to information. But the user must face and overcome the main problems of information discovery because of its characteristics of heterogeneous and dynamic nature. With the integration of the local highlights the importance of Deep Web data, Deep Web data access issues are attracting attentions from more and more scholars. This paper aims to do a study of relevant technologies of deep-web data acquisition, and gives solutions to the incremental data acquisition.The main contents are as follows:(1) To analyse the background of incremental deep web data acquisition technology and to provide the purpose and significance on the basis of this study.(2) To achieve based on data source that search engine spiders, that is, analyze the the results of the traditional search engine to find focused crawler of form areas.(3) To propose a set of complete methods of determine and classify data sources including a series of filters and invalid form of heuristic rules based on similarity calculation of the form data source classification.(4) To propose a Web log of the automated extraction method, which Web records by visual feature extraction, mixing conditions by two-dimensional data marked with the airport.(5) To do a research on the changes in frequency of part of deep Web databases, and to propose strategies for Deep web incremental acquisition, to distribute download resources in the data source level and query words on different size-class respectively. In addition, the paper also do the experiments on methods and techniques of the study, and further demonstrate that the proposed methods are effective through the analysis of experimental results.