Based on research and application of high-dimensional data clustering algorithm
|School||Nanjing University of Technology and Engineering|
|Course||Applied Computer Technology|
|Keywords||Cluster analysis Biclustering High-dimensional data Penalty strategy Double clustering algorithm|
Accumulated in these areas in recent years, with the rapid development of bioinformatics, e-commerce and other industries, a large number of high-dimensional data, using data mining techniques to find these data play an important role in scientific research and marketing The value of the information. Cluster analysis techniques, traditional clustering methods only in the row or column of data matrix a certain direction, it can only find the global information and high-dimensional data is characterized by containing a lot of local information, which is the traditional clustering method can not be found. To better clustering high-dimensional data, especially in high-dimensional data space clustering local information, Biclustering this new clustering method has been more widely used. The reason why the double-clustering algorithm better adapted to high-dimensional data, because double clustering algorithm is a cluster, which makes double clustering algorithm can be more effectively found that the high-dimensional data on a data matrix of rows and columns in two directions at the same time local information. The emergence of dual-clustering algorithm to solve the bottlenecks encountered by traditional clustering in the clustering of high-dimensional data, but still in its infancy double clustering algorithm research at home and abroad in recent years, various double poly class of algorithms are also there are a variety of inadequacies, Research and Improvement double clustering algorithm is particularly necessary. The main work of this paper is the first double clustering definition, the type of structure as described in detail, and then double clustering algorithm applications in recent years, more of the mathematical model, clustering strategies analyzed these dual The advantages and disadvantages of the clustering algorithm. Suitable for high-dimensional data based on the analysis of a variety of double-clustering algorithm in the study, based on double penalty strategy clustering algorithm (Penaltystrategy based Overlapping Biclustering Algorithm, referred POBA). Focus for the Cheng andChurch algorithm in each iteration, the random number should be introduced to replace the alternative process elements in the clustering results improved penalty strategy to improve the iterative process of the double-clustering algorithm, the strategy can be successful completion of the data matrix dual clustering, while avoiding the random numbers interference in the greedy search strategy and the introduction of the control the POBA algorithm by setting the parameter θ punishment to control double clustering results overlap rate effect, which makes the algorithm can be flexible to meet different clustering applications demand. Finally, designed and implemented the the POBA algorithm and its application in the public high-dimensional data set clustering experiments, through the analysis of the experimental results verify the effectiveness of the algorithm, while the results of the analysis of the experimental data, determine the algorithm the principle of the parameter settings.