Study on Frequent Pattern Mining Algorithms and Pruning Strategies
|Keywords||data mining association rule analysis frequent patterns mining frequent sequences mining frequent itemsets mining enumeration space pruning strategy maximal frequent patterns mining frequent closed patterns mining|
Frequent patterns mining is a fundamental and essential problem in data mining and can be used in many data mining applications such as association rules analysis, correlations analysis, outlier analysis, classification, clustering, etc. Frequent patterns mining is an important research topic in both theory and application study. In this thesis, we make a study on frequent patterns mining problem in depth. The main research contents and contributions include the following.Firstly, frequent sequences mining problem is studied. Previous mining algorithms are analyzed, which include GSP, SPADE, SPAM and PrefixSpan. Based on this analysis, a novel frequent sequences mining algorithm - FINDER (Frequent Itemset—based exteNsion canDidate genERation) - is proposed. FINDER searches the enumeration space in a DFS way. Complex hash and multi-database scanning are replaced by bitmap representation of itemset, database and sequence in FINDER. FINDER generates a candidate by extending current frequent sequence with a frequent itemset. Thus most of in-frequent extensions are not attempted. Experiments on synthetic datasets show that the performance of FINDER is almost as same as that of SPAM, which is to our best of knowledge the fastest algorithm. FINDER outperforms other algorithms by a factor of 3 to 5.Continually, FINDER is extended with lattice theory, and a parallel algorithm - pFINDER - is gotten. According to lattice theory, the search space is divided into several non-intersect parts in pFINDER, each of which can be enumerated independently. Thus, no remote communication is needed, and pFINDER can scale-up.Additionally, FINDER is improved to mining weighted frequent patterns and iFINDER is gotten. With item renaming, iFINDER changes weighted frequent patterns mining problem into frequent patterns mining. It can be used for interactive mining applications.Frequent patterns mining is an I/O-intensive and computing-intensive task. Pruning strategy is an effective method to improve performance of mining algorithms. This thesis presents two novel pruning strategies, denoted as SEP (sequence extension pruning) and IEP (itemset extension pruning). The correctness of SEP and IEP strategies is confirmed by reasoning and experiment. Both of SEP and IEP can be applied in maximal frequent patterns mining, closed frequent patterns mining and frequent patterns mining.Lastly, SEP and IEP strategies are applied to improve important frequent patterns mining algorithms, such as SPAM, SPADE, MAFIA, CHARM, etc. All the improved frequent sequence mining algorithms (SPAM+ and SPADE+) and frequent itemset mining algorithms (MAFIA+ and CHARM+) outperform previous ones by a factor of up to 10 on synthetic datasets. On large datasets, the performance is improved by 30%-50%.SEP and IEP can be used in different kinds of frequent patterns mining algorithms and improve the performance. Thus, SEP and IEP are independent of underlying algorithms and data structures. SEP and IEP can be shared by different algorithms.