The Research on Data Preprocessing in Data Mining Based on Rough Sets Theory
|School||Central China Normal University|
|Course||Applied Computer Technology|
|Keywords||data preprocessing data mining rough sets theory data discretization attribute reduction|
With the advent of information age, human being faces more and more data in all kinds of fields. In the meantime, all the data are increasingly growing at an amazing speed. In order to improve the efficiency of work and quality of life, people have to derive valuable knowledge embedded in data from databases. For the aim, people have begun the research on knowledge discovery in databases. As we all know, however, usually there are redundant data, missing data, uncertain data and inconsistent data in the databases and they become a great barrier to extracting knowledge from databases. So data preprocessing has to be done before our knowledge discovery in database. This thesis pays much attention to the research on data preprocessing, especially focuses on the aspects of data discretization and attribute reduction.Firstly, the history, status of quo and possible development direction of data mining are introduced and the main methods and techniques of data mining are also reviewed. Secondly, the rough set theory and data preprocessing are introduced and the general application procedure of rough set theory in data mining is analyzed. Then the author’s researches on data discretization and attribute reduction are introduced in detail. In which the general mathematical presentation of discretization’s definition on continuous data and the formula of computing the cutpoints on the attribute-class difference discretization algorithm are put forward in turn. So far there is few discretization algorithm with both supervised and dynamic characteristics, but such a discretization algorithm is researched and proposed in the thesis. It’s advantage is both class label of every attribute and the correlation ofevery attribute are considered. The attribute reduction algorithm is an enhanced algorithm base on indiscernibility matrix, it will not find all the possible reducts but find one better reduct.