Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

Forum message text for effective data extraction studies

Author WangZhou
Tutor HuangBenXiong
School Huazhong University of Science and Technology
Course Communication and Information System
Keywords Forum Text Summary of extraction Similarity Vector space model
CLC TP391.1
Type Master's thesis
Year 2011
Downloads 44
Quotes 0
Download Dissertation

As web technology updates, the Internet ushered in a new round of major developments in the rapidly changing information age, people on the network can get more effective knowledge and data. However, with the explosion of information accumulation and growth, one needs a simple and direct way to see most of the information and retrieve most of the problems. The forum's popularity and become an important branch network development. For many people a lot of information sharing, problem solving are done on various forums. Therefore, effective extraction of information on the forum posts, simplify the information presented is an increasingly urgent priority task. For Forum Info text extraction, the extraction is a summary of the most important task. In this study, the information on the Forum summary text extraction, and the characteristics of the Forum, were especially BBS forum for the practicality of this platform improvements, not just a traditional text summarization extraction. For the forum, the assumed functions are two major categories, the first category is information dissemination and comments, the second category is the information obtained and quizzes. For these two functions, this study provides a summary and effective response to the extraction work. For the first category of summary extraction, presents an algorithm based on maximum redundancy and sub-topics related to cluster analysis, and comprehensive context-sensitive feature of the algorithm. For longer forum posts, a summary of the steps taken, first, between sentences consecutive sub-topic cluster analysis, using the improved selection and initial point K value selected K-means clustering algorithm; Secondly, based on sentences and articles the similarity between the partition of the sentence or paragraph of the clustered selection; Finally, a subset of each cluster in the context of the sentence level and related features comprehensive scoring sort, get the final output. Comparison results show that the proposed method is better than the results of the basic maximum redundancy related algorithms, and practical; For the second category of replies extraction, improved language model using a correlation model based algorithms, mainly in corpus-based quiz Set of words related degree obtained for the similarity between the original and the response calculated with the vector space model, the language model using word-level similarity accumulate. In a large corpus, based on the results of this model was slightly better than the vector space model.

Related Dissertations
More Dissertations