Research on Text Classification Based on Biomimetic Pattern Recongnition
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||Text Classification Biomimetic Pattern Recognition Feature Selection Hyper Sausage Neuron Network|
With the advent of the Internet era, the amount of electronic data increases dramatically. Thus the problem on how to obtain, manage and make full use of the text data has become an urgent issue in information science. And Text classification(TC) is a very important research field of information technology, which categorize natural language texts according to given topics. Biomimetic Pattern Recognition(BPR) is based on“matter cognition”instead of“matter classification”, it is better closer to the function of human being, rather than traditional text classification (or traditional pattern recognition) using“optimal separating”as its main principle. So we apply BPR principle to text classification in this paper.BPR is a new theory which is different from traditional pattern recognition. The basic idea of this theory is based on the fact of the continuity in the feature space of any one of the certain kinds of samples. It identifies samples by the method of optimally covering the high dimensional geometrical distribution of the sample set in the feature space. This paper takes up a depth study on the mathematical tools and realizing way of BPR, and a novel text classification algorithm based Hyper Sausage Neuron Network is proposed.Further, we present three improved methods on the new classification algorithm. Firstly, the research on the noise and redundancy of train data enabled us to present an integration of cluster method and HSN classifier. Secondly, according to the research on the mistaken identification of border samples, we propose k-best identification algorithm based on HSN network. Thirdly, we also give a twice-feature selection method to solve the noise problem of feature. Furthermore, we present an integration of HSN and SVM.The experimental results on English corpus show that the improved HSN classification algorithm contrasted to KNN and SVM achieve a better performance. On Chinese corpus the improved HSN classification algorithm also have more advantages than KNN, and the integration of HSN and SVM performs better than either of them.