Research on Key Technologies of Deep Web Information Integration
|Course||Applied Computer Technology|
|Keywords||Deep Web Information Integration Data Sources Discovery Data Sources Classification Incremental Crawling Schema Extraction Data Extraction|
As the amount of information on the Web increases rapidly, the Web has been rapidly deepened with the prevalence of searchable databases online. Traditional crawler does not index Deep Web information for some technical reasons. As a result the information is hidden and invisible to users, so they are called Deep Web. To help users to make use of Deep Web effectively, the research on the information integration has become a really urgent issue, with very broad prospect in application and practical value. Deep Web information integration has already become the focus of research in recent years.This dissertation fully analyses the status quo and developing trend of Deep Web information integration. Based on the preliminary work of our research group, it addresses several key technical problems of this technology, covering Deep Web sources discovery, Deep Web sources classification and clustering, Deep Web incremental data crawling strategy, Deep Web schema and data extraction. The achievements have been made in the following five respects.(1) For the dynamic and sparse distribution of Deep Web sources, we have proposed a query interface focused crawling method to discovery Deep Web sources, visiting specifically the links that may link to entrance pages of Deep Web and avoiding downloading unnecessary pages. In addition to the features of the entrance pages, links and anchor texts, the features of link path to the goal pages are also considered. Experiments have shown that our method effective improves the efficiency of Deep Web sources discovery.(2) By organizing Deep Web sources into a domain hierarchy, users can browse these valuable resources conveniently. This is also one of the critical steps toward the large-scale Deep Web information integration. In this dissertation, we have proposed a Deep Web sources classification method based on query interfaces features and a Deep Web sources clustering method based on query interfaces link graphs. These methods have favorable expansibility for its non-necessity in sampling data hidden in Deep Web and the convenience in crawling query interface pages. And so the Deep Web sources can be automatically organized according to their domains respectively.(3) As Deep Web is updated autonomously and independently, we have to crawl Deep Web content to periodically check their updates. Since different data sources or different records are changed in different frequencies, resources will be wasted by updating all the data in the same frequency. We have proposed two Deep Web incremental refresh policies based on different granularity, which can be classified into data sources or data records. According to different application requirements, different granularity can be selected. Experiments show that our proposed refresh policies can improve the freshness of local data very significantly by using the same total download resources.(4) Query interfaces and result pages of Deep Web are mainly described in HTML language, which makes data semi-structured or unstructured. Pages are designed for people to browse and not for machine process, thus by accessing visual features from pages, human behavior might be simulated. In this dissertation, we have proposed a Deep Web data extraction method based on visual features. The existing solutions are primarily based on the analysis on HTML DOM tree and HTML tag. Our method has avoided the dependence on the definition of HTML in the traditional method based on DOM tree. And in our method, data can be described by HTML or other markup languages including nonstandard HTML description. And so it has high adaptability.(5) Based on the key technologies and practical requirement, we have proposed a Deep Web information integration architecture and implemented a prototype system of Deep Web information integration. The system has functions such as sources discovery, sources organizing and data extraction, etc. Practical application shows that the system has certain practical value.This work is partially supported by Natural Science Foundation of China under grant No.60673092, the High-Technology Research Program of Jiangsu Province Under grant No. BG2005019 and the Higher Education Graduate Research Innovation Program of Jiangsu Province in 2007 under grant No.cx07b-122z.