Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

Research on Text Classification Based on Biomimetic Pattern Recongnition

Author HuangQiHu
Tutor WangYuYing
School Harbin Institute of Technology
Course Computer Science and Technology
Keywords Text Classification Biomimetic Pattern Recognition Feature Selection Hyper Sausage Neuron Network
CLC TP391.1
Type Master's thesis
Year 2008
Downloads 95
Quotes 0
Download Dissertation

With the advent of the Internet era, the amount of electronic data increases dramatically. Thus the problem on how to obtain, manage and make full use of the text data has become an urgent issue in information science. And Text classification(TC) is a very important research field of information technology, which categorize natural language texts according to given topics. Biomimetic Pattern Recognition(BPR) is based on“matter cognition”instead of“matter classification”, it is better closer to the function of human being, rather than traditional text classification (or traditional pattern recognition) using“optimal separating”as its main principle. So we apply BPR principle to text classification in this paper.BPR is a new theory which is different from traditional pattern recognition. The basic idea of this theory is based on the fact of the continuity in the feature space of any one of the certain kinds of samples. It identifies samples by the method of optimally covering the high dimensional geometrical distribution of the sample set in the feature space. This paper takes up a depth study on the mathematical tools and realizing way of BPR, and a novel text classification algorithm based Hyper Sausage Neuron Network is proposed.Further, we present three improved methods on the new classification algorithm. Firstly, the research on the noise and redundancy of train data enabled us to present an integration of cluster method and HSN classifier. Secondly, according to the research on the mistaken identification of border samples, we propose k-best identification algorithm based on HSN network. Thirdly, we also give a twice-feature selection method to solve the noise problem of feature. Furthermore, we present an integration of HSN and SVM.The experimental results on English corpus show that the improved HSN classification algorithm contrasted to KNN and SVM achieve a better performance. On Chinese corpus the improved HSN classification algorithm also have more advantages than KNN, and the integration of HSN and SVM performs better than either of them.

Related Dissertations
More Dissertations