Forum message text for effective data extraction studies
|School||Huazhong University of Science and Technology|
|Course||Communication and Information System|
|Keywords||Forum Text Summary of extraction Similarity Vector space model|
As web technology updates, the Internet ushered in a new round of major developments in the rapidly changing information age, people on the network can get more effective knowledge and data. However, with the explosion of information accumulation and growth, one needs a simple and direct way to see most of the information and retrieve most of the problems. The forum's popularity and become an important branch network development. For many people a lot of information sharing, problem solving are done on various forums. Therefore, effective extraction of information on the forum posts, simplify the information presented is an increasingly urgent priority task. For Forum Info text extraction, the extraction is a summary of the most important task. In this study, the information on the Forum summary text extraction, and the characteristics of the Forum, were especially BBS forum for the practicality of this platform improvements, not just a traditional text summarization extraction. For the forum, the assumed functions are two major categories, the first category is information dissemination and comments, the second category is the information obtained and quizzes. For these two functions, this study provides a summary and effective response to the extraction work. For the first category of summary extraction, presents an algorithm based on maximum redundancy and sub-topics related to cluster analysis, and comprehensive context-sensitive feature of the algorithm. For longer forum posts, a summary of the steps taken, first, between sentences consecutive sub-topic cluster analysis, using the improved selection and initial point K value selected K-means clustering algorithm; Secondly, based on sentences and articles the similarity between the partition of the sentence or paragraph of the clustered selection; Finally, a subset of each cluster in the context of the sentence level and related features comprehensive scoring sort, get the final output. Comparison results show that the proposed method is better than the results of the basic maximum redundancy related algorithms, and practical; For the second category of replies extraction, improved language model using a correlation model based algorithms, mainly in corpus-based quiz Set of words related degree obtained for the similarity between the original and the response calculated with the vector space model, the language model using word-level similarity accumulate. In a large corpus, based on the results of this model was slightly better than the vector space model.