The Research of Chinese Document Classification Algorithm
|Course||Communication and Information System|
|Keywords||Text Categorization Text Semantic similarity DSM Support Vector Machine|
With the rapid development of network information , the information processing has become for people to obtain useful information indispensable tool for automatic text classification systems is an important research direction of information processing , it is in the given classification system , based on the contents of the text automatically determine the text the classification process . This paper presents a semantic-based natural language text classifier model . The model by calculating the training set of terms and categories weighted mutual information , text feature set , and then through the intelligent segmentation and statistical methods to test text VSM space and TF-IDF function , HowNet the concept of knowledge sources , obtained by calculating the semantic similarity of text semantic information , text vector weighted . Training text set vector representation in accordance with the above method , as support vector machine learning vector for training, support vector to obtain a text classification . For the text to be classified , in accordance with the above method to quantify the discriminant categories of the text , and then support vector machine . The model is based on a text classification system , the system has a high recognition rate and the recall rate , higher processing speed and smaller features of the processor overhead , Fudan University , People's Daily , the actual corpus the experimental results show that the performance of this classification is to meet actual needs . Ideas in two aspects : First, based on semantic similarity Text text vector is weighted so that the text feature vector quantities under certain conditions , can reflect more text content information ; second , based on DSM 's knowledge reduction algorithm and incremental machine learning algorithms to document feature vector from the study, resulting in the increase of the test document , and gradually get more new document feature vector .