Research on Text Clustering Based on Hownet
|School||Hebei University of Technology|
|Keywords||text clustering vector space model hownet textual similarity|
K-Means algorithm is a classical algorithm of data mining technology, and it has the advantage of brief form and low time and space cost. It is also used widely in text mining. The paper researches on the key technology and algorithm in text clustering and puts forward a new method of calculating the similarity of texts based on hownet and improves the K-Means algorithm.The main work of the paper is to explore the effect of three text similarity calculating methods on K-Means algorithm. Using the classical vector space model based text similarity calculating method, hownet based text similarity calculating method and position information involved text similarity calculating method, the paper completes K-Means algorithm. To define the hownet based text similarity calculating method, the paper put forward a new way of generating vector space. It use the words of one text to generate a vector for the text,thus, the dimension of the vector equals to the number of words in the only text but not the number of words in all the text set. In this method, the high dimension and sparsity is reduced. The paper also talks something about the relation between the space and Euclid space. To define the position information involved text similarity calculating method, The paper also put forward that the similarity of two words should be decided by the words meaning similarity and position similarity. The paper also explore the method that how to correct the similarity of two words.