The Research of Cloud Frequent Itemsets Mining Algorithms Which Based on Sample
|School||Hefei University of Technology|
|Course||Applied Computer Technology|
|Keywords||data mining frequent itemsets Hadoop mapreduce|
With the development of data collection technology, the era of massive data is coming. Business competition is fierce in today’s society, people are eagering to extract useful informations from massive data which help them to make correct business decisions. However, the traditional data analysis and data mining techniques are difficult to meet the demand of people in dealing with massive data, because of the excessive high cost of times and spaces. For example, the traditional frequent itemsets mining needs to scan data sets so many times that cost a lot of times.And it also needs to store a large number of candidate itemsets,which consumes large amount of memories.At the same time, cloud computing with high concurrency and low cost of mass data processing,is developing with high speed. In recent years, Hadoop ecosystem’s development is the most representative. Hadoop is mainly composed of two parts:HDFS and Mapreduce. It uses cheap commercial machines as compute nodes to constitute a cloud platform which can efficient processing massive data.Combine data mining with cloud computing, this means using the advandage of cloud computing such as efficient processing massive data to deal with massive data mining which will bring new vitality to traditional data mining technology. This thesis aims at combining the data mining’s frequent itemsets mining with cloud computing. The main work is as follows:(1) On the first, this thesis gives an in-depth research and analysis of Hadoop platform. Two core parts of Hadoop:HDFS distributed file system for mass data storage, mapreduce parallel programming framework for data processing. These two parts both supplement each other, constitute Hadoop distributed framework.(2) In order to further improve the efficiency of frequent itemsets mining, a parallel sampling algorithm based on Hadoop is proposed in this thesis. This algorithm which using the mapreduce programming framework can achieves a random sampling by scanning the massive data just one time.In the sampling process, the clean-up work also can be made on the data by the same time(3) After making an in-depth research on traditional mining algorithm of frequent itemsets, a cloud frequent itemsets mining algorithm which based on sample is proposed in this thesis. The algorithm uses Hadoop platform to make full use of the advantage of cloud computing to process massive data.Result of experiments shows that this algorithm has a good mining performance.