Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer software > Program design,software engineering > Programming > Database theory and systems

The Analysis on the Basic Techniques for Preprocess of Text Mining and the Study on the Application of Text Mining

Author SunDaoJun
Tutor LvTingJie
School Beijing University of Posts and Telecommunications
Course Management Science and Engineering
Keywords Chinese word segmentation Vector Space Machine(VSM) K nearest neighbor (KNN) Text Mining
CLC TP311.13
Type PhD thesis
Year 2008
Downloads 720
Quotes 2
Download Dissertation

The general workflow of text mining has been systematically explained and implemented in this thesis. The key techniques used in text mining including collecting text, preprocess of text, automatic Chinese word segmentation for the processed documents ,selecting training pattern and reducing support vectors, text training and text mining. We divide the system into four parts based on analysis of the system’s requirement: text collecting and preprocess, Chinese word segmentation, selecting training pattern vector and the training and classification of the text patterns vector.Unlike the general text mining, we need to collect test, preprocess these text and save the weight of the text. We implement a preemptive multi-thread web text collector. It collects the text of special catalog using Depth First Algorithm. And we implement a text preprocessor to erase the Tag and set the weight for the web Text by using recursive match method. On the other parts, we first introduce a classifier using the nexus between words and type to properly select training pattern and to reduce support vectors. And then we introduce the basic theory about K nearest neighbor (KNN) , the application of KNN in text classification and the software KNN. The extracted patterns and their weight are used to form the input file, through which we can implement text training and text classification.The author implement the text collector and preprocessor and the Chinese word segmentation machine for text mining, propose a new solution for selecting the text patterns and text mining based on our study.

Related Dissertations
More Dissertations