Research on Key Issues in Deep Web Data Integration
|Course||Computer Software and Theory|
|Keywords||Deep Web Data Integration Query interface matching Web Data Extraction Semantic annotation of Web data Duplicate records|
With the rapid development of Internet technology, the Web has become a huge source of information, has a mass data. These data have important value, many applications, such as market intelligence analysis, an urgent need to use these data to analyze the mining, gain useful knowledge, the greatest degree of decision support. However, Web data has the characteristics of large-scale, heterogeneous, autonomous, distributed, which makes the analysis of Web data mining has become particularly difficult, it is imperative to integrate them to provide high-quality data, analysis of mining. Based on the \Deep Web data on the quantity and quality of far more than the Surface Web, has a higher value. Therefore, How Deep Web data integration, in order to facilitate more effective analysis of mining has important practical significance and broad application prospects. Deep Web research mainly focuses on query-oriented Deep Web data integration, this integrated approach to obtain the limited amount of data, suitable for users instant query demand, but could not do the analysis of mining target application. This article is dedicated to the analysis-oriented Deep Web data integration research, the goal is to maximize access to the Deep Web pages, the use of the extract and eliminate re-structured technology, high-quality data, and provide data support for further analysis of mining. Integrated oriented analysis DeepWeb data the following problems to be resolved: (1) Since the analysis of mining requires large amounts of data, and these data from Deep Web Deep Web pages in the field of multiple Web databases dynamically generated, therefore, need to automatically to maximize access to these pages; (2) analysis of mining requires well-structured, semantically rich data, and these data exist in a complex, semi-structured DeepWeb page, therefore, the need to accurately from the page structured data extraction and semantic understanding; (3) the analysis of high-quality data mining need to be unified, and these data duplication exists in the same field multiple Web databases, the need for duplication between multiple Web databases record detection. In this paper, analysis-oriented Deep Web data integration as the goal, for which key issues, the main work and contributions are summarized below: 1. Proposed based on extended evidence theory for Deep Web query interface matching approach, an effective solution to the same query interface semantic understanding in the areas of different Web database crawling problem. The same areas of memory in a large number of Web databases, Web database query interface mode between heterogeneous, leading to different Web database crawling is difficult to identify a unified way the need to put the interface properties of the query words, the impact of Deep Web page the acquisition. To solve this problem, this paper, based on extended evidence theory for Deep Web query interface matching method, crawling through the building to be a match between the query interface in the field of Web database query interface corresponding to understand the semantic information of the query interface attributes . The method takes full advantage of a variety of characteristics of the query interface, build different match, extended dynamic forecast credibility of each match existing evidence theory, a combination of multiple matching results, improve adaptability combination; match decisions through the top-k global optimal strategy and the tree structure heuristic rules, the final match between the use of the matching relations understanding to be crawling the Web database query interface. The experimental results show that this method has a higher matching accuracy, effectively overcome the existing query interface matching method adaptability match the lower accuracy of less than. 2. Propose a Web database crawling method based query words adopted new rate model, an effective solution to the massive Deep Web pages get. Analysis of mining target application requires a large number of Deep Web data, these data from Deep Web pages generated dynamically in the field of multiple Web database, but unique Web database query interface access, making the traditional search engine crawlers can not climb whichever content. To solve this problem, this paper presents a query words adopted the model of the new rate-based Web database crawling method. The method adopted for Web database sampling, the sampling data, choose a variety of features to automatically build training samples to avoid manually labeled samples; using multiple linear regression method, adopted a new rate model training samples to build the query words, with the iteration of the model submit query, select the query words in order to achieve the Web database crawling. The experimental results show that the use of the Web database crawling with high coverage, effectively overcome existing Web database crawling method uses heuristic rules to select the query words the lack of a single and experience, to learn the query words adopt a new rate model can be effectively applied to other Web databases of the same field crawling. 3 presents a Deep Web data extraction method based on hierarchical clustering, an effective solution to automatically extract structured data DeepWeb page. Deep Web pages exist in the form of semi-structured, difficult to automate processing structured data. To solve this problem, this paper proposes a hierarchical clustering based Deep Web data extraction method. The method by utilizing the information of the query results list page to assist in identifying the contents of the block in the Deep Web page to determine the area of ??data extraction: through comprehensive utilization of multiple Deep Web page structure and content features on these pages in the same block of content The content node eigenvectors hierarchical clustering, enabling the extraction of web data records. The experimental results show that this method has a high extraction accuracy overcome most of the existing methods using only the structure of the page itself lead to lower extraction accuracy less than. 4. Based constraints with airport Deep Web data semantic annotation method to effectively solve the Deep Web data semantics missing more than one Web site data recording mode between heterogeneous problem. Deep Web page if you rely solely on existing semantic tags labeling extract Web data record, can not handle the lack of semantic tags, and different sites usually use different semantic tags, resulting in patterns between different Web site data records The heterogeneous. To solve the above problem, this paper proposes a constraint-based semantic annotation method of Deep Web data with the airport. The method uses existing Web database information to build credibility constraints, the logical relationship between the use of Web data record data elements to build logical constraints, two types of constraints into the traditional conditional random field model, building constraints random field model integer linear programming reasoning method using the global properties of the field of Web database schema label set for each data element in the Web data record given the corresponding semantic tags, in order to achieve the semantic annotation of Deep Web data, as well as multiple Web sites data recording mode between unity. The experimental results show that the method has a high accuracy rate of semantic annotation, to overcome the traditional Conditional Random Fields can not lead to a marked lower accuracy rate less than comprehensive utilization of the logical relationship between the existing the Web database information and Web data elements. 5 proposed a duplicate record detection method based on unsupervised learning, an effective solution to the problem of duplicate record detection of large-scale Deep Web. The number of Web database in the same field and a high degree of data redundancy to provide high-quality data, it is difficult for the analysis of mining. To solve this problem, this paper presents a duplicate record detection method based on unsupervised learning. The method through the use of cluster ensemble method automatically selects the initial training samples to improve the accuracy of training samples; integrated through the use of the extended theory of evidence through the use of support vector machines iterative classification method to construct a classification model to improve the classification accuracy of the model; The results of a classification model, build the field duplicate record detection model, enabling Web database in the same field duplicate record detection. The experimental results show that this method has higher duplicate record detection accuracy, the resulting field duplicate record detection model has a better performance in their respective fields, effectively overcome the traditional method is difficult to carry out the lack of mass duplicate record detection.