Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

The Research of Chinese Document Classification Algorithm

Author ZhangBin
Tutor ZuoPuLiu
School Wuhan University
Course Communication and Information System
Keywords Text Categorization Text Semantic similarity DSM Support Vector Machine
CLC TP391.1
Type Master's thesis
Year 2004
Downloads 314
Quotes 30
Download Dissertation

With the rapid development of network information , the information processing has become for people to obtain useful information indispensable tool for automatic text classification systems is an important research direction of information processing , it is in the given classification system , based on the contents of the text automatically determine the text the classification process . This paper presents a semantic-based natural language text classifier model . The model by calculating the training set of terms and categories weighted mutual information , text feature set , and then through the intelligent segmentation and statistical methods to test text VSM space and TF-IDF function , HowNet the concept of knowledge sources , obtained by calculating the semantic similarity of text semantic information , text vector weighted . Training text set vector representation in accordance with the above method , as support vector machine learning vector for training, support vector to obtain a text classification . For the text to be classified , in accordance with the above method to quantify the discriminant categories of the text , and then support vector machine . The model is based on a text classification system , the system has a high recognition rate and the recall rate , higher processing speed and smaller features of the processor overhead , Fudan University , People's Daily , the actual corpus the experimental results show that the performance of this classification is to meet actual needs . Ideas in two aspects : First, based on semantic similarity Text text vector is weighted so that the text feature vector quantities under certain conditions , can reflect more text content information ; second , based on DSM 's knowledge reduction algorithm and incremental machine learning algorithms to document feature vector from the study, resulting in the increase of the test document , and gradually get more new document feature vector .

Related Dissertations
More Dissertations