Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

Research and Application on Short Message Text Clustering

Author FanChengQing
Tutor JiaHuaDing
School Southwestern University of Finance and Economics
Course Applied Computer Technology
Keywords Short message text Text Clustering Vector space model Suffix Tree K-MEANS
CLC TP391.1
Type Master's thesis
Year 2011
Downloads 50
Quotes 0
Download Dissertation

With the development of Internet communication technologies, popularization and application of the accelerated pace of modern life, cell phones, forums, online chat, microblogging instant interactive tools widely, resulting in a large number of short message text data. These short text data contains a lot of knowledge, these short message text analysis and mining-Fi hotspot information extraction, public opinion, to grasp the understanding of the information, commodities recommended all important. In the general study of text clustering, clustering object is a common length of text, most of them fairly standard, and the text word appears more than once a high probability, the text in the text on the same cluster intersecting or coverage, while intersecting in two text content, the more the greater the possibility of these two texts in the same cluster. The language of the short message text itself determine the characteristics of the different processing technology in natural language processing and common long text. The most significant features of the short message text is a very short length of the text, the text feature is extremely rare, this decision very difficult language features to extract the short message text, and eventually lead to a substantial increase in the difficulty in subsequent natural language processing. The text of the short message exists in the reality the characteristics of interaction, the data amount is increasing over time, leading to abnormally large amount of data processed per higher requirements than the conventional text, the time efficiency of the short message text processing technology . SMS text its main source of real exchange environment, which determines its expression is extremely simple, abbreviated terms, non-standard terms, misspellings to text processing, which brings a lot of noise, but also further increase the difficulty of extracting useful information from these disturbances, short message text. Therefore, the short message text clustering processing research has some practical significance, but also there is a great challenge. In this paper, the short message text mining for the background, short message text clustering technology content, expand the collection from the short message text pre-processing, feature extraction, similarity measure to short message text clustering algorithm comparison, a series of studies . Short message text with dynamic, interactive, non-normative, as well as large-scale features, this complex and clustering results from the time of cluster validity clustering algorithm intelligibility of the three aspects of the short message text clustering requirements. In this paper, the above requirements to improve the effectiveness of the clustering results and the clustering algorithm time complexity as the main target, carried out a series of research and exploration for short message text. This paper studies the content and results include the following points: Firstly, the theory and technology of text clustering more extensive and in-depth comparative study, and focus on the text representation model, text clustering algorithm, clustering results evaluation three elaborate and compare aspects were discussed in detail, and the status of their research, the theoretical basis and technical methods. Summarized data sources as well as the characteristics of the short message text, and short message text pre-processing technology, including the Chinese word segmentation, feature extraction and selection of certain research. Process based on the classic vector space model for text clustering process step, the vector space model of short message text vector representation, widely used K-Means clustering algorithm on the short message data set obtained poly class results and their analysis and evaluation. Suffix tree clustering algorithm (STC), the English text clustering to achieve better clustering effect is applied to the short message in the Chinese text clustering, combined with Chinese text clustering feature representation, feature extraction and clustering The improved algorithm to adapt to the characteristics of the short message text clustering. Through comparative experiments on the same short message text data sets based on these two algorithms, we conclude that: in the short message text clustering, than K-vector space model based the STC clustering algorithm based on the suffix tree model Means algorithm clustering validity of the results and the time complexity of the two aspects have a considerable advantage, and can be applied in the Chinese short message text clustering. Finally, according to the experimental results and the needs of the project design and to achieve a short message text-oriented clustering prototype system, the system is able to crawl to the Web-based short message text, and the short message text data set clustering process, which was found a hot topic, you can read the local short message text data sets, and its intuitive clustering analysis and clustering results show.

Related Dissertations
More Dissertations