Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

Based on Latent Semantic Indexing Research and Implementation of Text Categorization

Author SuXianYu
Tutor ZhangTianWen
School Harbin Institute of Technology
Course Computer Science and Technology
Keywords Text Classification Latent Semantic Indexing Partial Least Squares Regression
CLC TP391.1
Type Master's thesis
Year 2008
Downloads 72
Quotes 0
Download Dissertation

Latent Semantic Indexing model (Latent Semantic Indexing, LSI) is experimentally validated text classification techniques of effective dimensionality reduction algorithm. Latent Semantic Indexing model of the original feature space dimension reduction process is a dimensionality reduction while preserving the original features as possible the process of global information space , then this process will inevitably filter out certain categories of recognition is very important, but consider the overall situation is not very important feature . Therefore, for the above-mentioned problems, we conducted a traditional LSI model improvements. First, in the weight calculation based on word frequency based on the defects of traditional methods , this paper presents the calculation process concept document weights so that the new weight calculation method is more conducive to the formation of latent semantic space , more suitable latent Semantic Indexing model ; while increasing the word position information , making the words weight calculation more accurate. Then, in the traditional χ2 statistical methods based on the analysis , the traditional χ2 statistical method for rare category of information do not pay attention and for the particular case χ2 statistic error is too high and other defects, we introduce the frequency, concentration, dispersion three indicators , so that the new method is more accurate χ2 statistics . Finally, the paper in the traditional classification methods based on LSI increases the categories of information to consider the use of partial least squares regression proposed new text classification method , called latent semantic information based on category classification method (Latent Semantic Classification based on Category Information , LSCCI). This paper describes in detail the implementation of latent semantic indexing model principle, LSCCI derivation process was elaborated and LSCCI with other classical classification performance of the model were compared . Experimental data show that , LSCCI has better classification accuracy . In the English text classification experiments demonstrated for rare category classification model is more excellent than conventional classification performance .

Related Dissertations
More Dissertations