Term Weight-Based Chinese Text Classification Algorithm
|School||Hebei University of Technology|
|Course||Applied Computer Technology|
|Keywords||Information retrieval text classification vector space model feature extraction feature weighting|
With the rapid development of the Internet, in particular the Internet popularity, number of pages soared. So how in the vast information resources quickly and efficiently find the information they need to become a research focus. Most of the content of Web pages are text messages, so how to text message a web page automatic categorization become an important research subject. Automatic text classification is an essential first step in information retrieval, it refers to the classification of a given system, according to the text content automatically determines the process of text types in order to facilitate information retrieval. Through the classification system, information can be an effective organization and management, is conducive to rapid and accurate positioning information.This paper introduces the automatic text categorization at home and abroad of the status followed by the text automatic classification involved in key technologies, including information retrieval model, Chinese word segmentation, feature extraction, feature weighting methods and the critical classification algorithm were carried out research and exploration; re-entry in the feature weight, we analyzed the characteristics of items of traditional weight disadvantage, through the weights for the characteristics of commonly used TF-IDF method of analysis, an improved method of weight calculation. The weight calculation method to the characteristic features of the right to assess the function included in the calculation, in accordance with the characteristics of text categorization ability to distinguish right to adjust its weight in the calculation of contributions. Empowerment in character, made with TF-IDF weighting and x2 statistics calculation. Experiments show that the weight calculation method improved the classification accuracy has increased.Finally, this paper introduces the vector space model based on Chinese text categorization system, the overall framework, the system processes and function modules; Finally, the classification system implemented in a variety of feature extraction algorithm, the weight algorithm and classification algorithm were experimentally compared.