Research on Key Algorithms for Mining Frequent Patterns in Data Streams and Their Application
|School||Central South University|
|Course||Applied Computer Technology|
|Keywords||data stream mining maximal frequent itemsets closed frequent itemsets Top-K closed frequent itemsets intrusiondetection|
With the rapid development of the computer technology and wide application of information technology, data stream has been widely used in performance detection of business management, abnormal detection and alarming of network flow management as well as the transaction processing in retail industry. Analysis and mining of data stream have become a hot issue for research, of which frequent pattern mining of data streams is one of the most fundamental. Consequently, the research in this field is more challenging.Current intrusion detection system based on data mining technology can neither detect new and unknown attacks, nor meet the actual requirements of application in terms of real-time response and accuracy. Research of data stream frequent pattern mining algorithm with high efficiency and real-time response by applying it into IDS will put the intrusion detection into practice. Therefore, research of IDS based on data stream mining technology is significant both in theory and practical use.In consideration of the existing problems such as complexity in compression storage structure, mass maintenance of nodes and higher consumption of space and time, a typical algorithm for mining maximal frequent itemsets based on prefix pattern tree, called MMFI-DS, is then presented. The compression storage structure designed by this algorithm, SEFI-tree, is simple in structure and has a strong competence in capturing key information of data streams with less expenditure in maintenance of nodes. By adopting the top-down and bottom-up approaches, the shorter supersets of infrequent items and the longer subsets of maximal frequent itemsets can be promptly pruned to reduce calculation of itemsets support and expenditure of the algorithm.In consideration of the existing problems such as larger search space, redundancy in intermediate results and inefficiency in time consumption, a closed frequent itemsets mining algorithm called A-NewMoment based on binary bit is presented. By adopting bitmap technology on the basis of binary bit, the algorithm has designed BV-DFIlist data structure to record data stream information and closed frequent itemsets with presentation of "WSS" and "CSS" search strategies for rapid mining of closed frequent itemsets other than the longest closed frequent itemsets with maximum support produced by frequent1-itemset. Thus, time efficiency is greatly improved by avoiding massive intermediate results. Studying "DNFIPS" for rapid deletion of no closed frequent itemsets from the HTC in storage closed itemsets as well as the "DNSS" for maintenance of all changed closed frequent itemsets can reduce the maintenance cost of closed frequent itemsets and improve time efficiency of this algorithm.In consideration of the existing problems such as inaccuracy of the current algorithm, higher maintenance cost of nodes and inefficiency in time and space, a mining Top-K closed frequent itemsets algorithm MTKCFI-SW is presented. The algorithm has designed a compact prefix tree called CFP-tree to compress effective information in sliding windows for storage of data stream, which features in less information storage and lower maintenance cost for improvement of time efficiency. The CFP-tree, which is capable of promptly capturing newly added data stream information, does not need to fix the size of sliding windows. By adoption of pointer operations, large quantities of infrequent itemsets can be deleted from the CFP-tree to improve time efficiency of this algorithm without traversing the whole CFP-tree. Besides, the adoption of "DD" of mining threshold and pruning threshold also improves the accuracy and time efficiency of the algorithm.For the purpose of improving the real-time online mining effect and detection precision of current intrusion detection system, the dissertation has presented an intrusion detection system model MMFIID-DS based on maximal frequent itemsets over data stream. By designing different pruning strategies and mining of trained normal and abnormal data sets as well as maximal itemsets of the current detection data stream, the system has established normal behavior and abnormal behavior and as well as user behavior patterns, with organic combination of misuse detection and anomaly detection for the purpose of achieving online detection intrusion as well as improving detection precision and response speed of system.