Design and Implementation of a Probabilistic Clustering Algorithm Based on Topic Term in Dataspace
|Keywords||dataspace clustering probability topic term|
In the past several decades, the development of relational database manage systems was mainly about the application of data management for people and it worked well. However, because of the rapid increase of data amount and various kinds of data in recent years, the need of data management has been changed a lot and it leads that so many dataspace systems have been researched. What’s more, how to cluster data in dataspace in an efficient and accurate way to help the user mange and explore them is also an intractable problem.Based on these, in this thesis, a novel data clustering model using the definition of topic term and probabilistic theory is put forward, which is called Probabilistic Clustering Algorithm Based on Topic Term, short for ProCATT. First, ProCATT classifies terms in data into several groups, according to a heuristics standard. And then, it transforms data to a representation of probabilistic vector. Finally, it establishes a probabilistic relation matrix M. Extensive experiment results show that the clustering algorithm has an excellent performance and outperforms some other classical algorithms.But the original algorithm still has some disadvantages; the original ProCATT is improved to a new and better algorithm. First, the improved ProCATT classifies terms in data into several groups, according to a new standard. And then, it transforms the data representation from a normal single vector to several probabilistic vectors. Finally, a comprehensive-relation matrix M is established. This matrix not only stores direct relationships among data, but also stores indirect relationships among them. Extensive experiment results show that the clustering algorithm has excellent performance and outperforms some other classical algorithms.Finally, this thesis is concluded and our future work is pointed out.