Research on Clustering Algorithm Based on Genetic Algorithm and Rough Set Theory
|School||Changsha University of Science and Technology|
|Course||Communication and Information System|
|Keywords||data mining K-means clustering algorithm genetic algorithm rough set incremental categorical data|
With the rapid development of computer technology and database technology, a large amount of data has been produced in various fields, many important information hidden behind these data, people want to analyze them in order to extract useful knowledge. Thus, data mining was proposed. Data mining is one of the most forward lines of database and information decision area. Cluster analysis is an important branch of data mining, and its basic purpose is to discover the natural group characteristics of the data by analyzing the similarity between them.This paper discussed the clustering algorithm and its incremental algorithm, both of which based on the genentic algorithm and rough set theory, then discussed the clustering algorithm for clustering the categorical data. The main research of this paper is as follows:1、It analyzes the advantages and disadvantages of the existing rough K-means clustering algorithm, according to the genetic evolution of the genetic algorithm and the maximum minimum distance algorithm, proposes a optimized method of rough K-means, the algorithm can determine the initial center dynamic and non-random, while the boundary object can be dealt very well. Experimental results show the effectiveness and correctness of the algorithm.2、It analyzes the advantages and disadvantages of the existing non-incremental rough clustering algorithm, based on incremental thinking and neighbors thought, proposes an incremental clustering method. Experiments show that the algorithm can make full use of the previous mining results to improve the utilization of existing information and clustering efficiency, it also can deal with large data sets under dynamic environments.3、An efficient categorical data clustering method is proposed, it extends the K-means algorithm to categorical data domain to overcome the shortcomings of the traditional K-means algorithm which can only deal with numerical data. In accordance with the information of data distribution correlated to each value of each categorical attribute,and at the same time combined with the vertical and horizontal distribution of the data to measure the difference between data object and the class,it defines a new distance metric. Experiments show that this method can find the intrinsic relationship between the different values of the same attribute,and could measure the difference between objects effective.