XML duplicate object detection system design and implementation
|School||Huazhong University of Science and Technology|
|Course||Computer software set theory|
|Keywords||Repeating element detection system Extensible Markup Language Similar string Multiple filters Top-down|
With the Internet and the rapid development of information technology , XML documents as a data storage medium more extensive range of applications , XML data, repeating element detection issue has caused in database and Internet applications, a lot of attention of researchers . The diversity of XML data structure to determine the similarity between XML elements caused great difficulties . In order to effectively remove the duplicate elements in the XML data , XML repeating elements of the identification rules , design and implementation of a repeating XML element detection system. Repeating XML elements of the criteria , and XML elements similar strings similarity calculation identify problems, analyze the XML element detection duplicate the key is how to effectively deal with structural diversity of problems and how to deal with parent and child dependencies between elements , and designed to achieve a repeating XML element detection system. Detection system consists of document preprocessing module , similar strings recognition module and element similarity calculation module. Achieved in the detection system , the paper presents a top-down , multiple filter detection method. Through the analysis of XML data storage structure , gives the repeating XML element object definition ; through the document to some extent, pre- XML structure to solve the problem of diversity ; through the design of a variety of filter conditions , effectively reducing the test string XML elements of similarity and the similarity computation ; through top-down traversal of XML elements to solve the dependencies between father and son element . Design and Implementation of a Dirty XML Generator (DXG) tool used to generate experimental data . To illustrate the detection system of the correctness and effectiveness of filters by DXG tool introduced into the inner structure of the XML data string error two types of errors and dirty data for each filter conditions have carried out a separate analysis, the detection the accuracy and efficiency of the system analyzed. Ultimately explains all the filtering criteria are effective and efficient , the detection system and pre- test results are also consistent with the introduction of dirty data .