Hot Topic Detection Based on Microblog Data |
|
Author | SunLi |
Tutor | WangXiaoJie |
School | Beijing University of Posts and Telecommunications |
Course | Computer Science and Technology |
Keywords | Natural Language Processing Microblog New Words Identification TopicDetection Heat evaluation |
CLC | TP393.092 |
Type | Master's thesis |
Year | 2013 |
Downloads | 163 |
Quotes | 0 |
As an emerging Internet media, microblog has gradually become a platform for majority of users to express their views and share information, there can be millions of microblogs released each day, the huge amount if information makes it difficult for users to browse all of the microblogs. At the same time, the propagation velocity of microblog topics is fast, the transmission range is wide and the social influence is high, therefore accessing hot topics from microblog data and return the relevant important microblogs can help users to quickly grasp the Public Interest, this has a high value for all kinds of microblog users to quickly understand key information. Meanwhile, the building way of microblog platform based on user relationship makes users can only receive relevant microblog information but can not directly receive the hot topic information of the entire microblog network, therefore hot topic detection from the microblog data mining can obtain a better user experience. Although microblog platform now has application such as hot topic list, it needs a lot of manual editing factors and main measure is term frequency, so it is difficult to reflect the true situation.This paper studies the topic detection and heat judgment related technologies at home and abroad first, then analyze hot topic of microblog data and related research on the application of the existing microblog hot topic, proposed a hot topic detection method based on the LDA model which can fully tap the theme information of the text for the shortcomings of the existing methods without traditional clustering methods. First, starting from the microblog content features, using N-gram model to extract repeated strings, then use statistical characteristics including both absolute and relative term frequency, mutual information, and adjacency information entropy to filter spam strings and extract microblog new words, so as to enhance the accuracy of segmentation results. Then use LDA model to mining theme information of microblog data, and treat theme as topic so getting a list of candidate topics, meanwhile determine the distribution of the topics on the words and the distribution of the documentation on the topics. At last, untilizing the results of GibbsLDA++tool, make each word and its respective topic a whole unit which is called single-word unit, calculate the weight of single-word units corresponding to words, so as to calculate the heat of topic, and finally find the hottest topics. The method using both the the time features and content feature of microblog and is more targeted, and rule out human-edited factors, so the topics are more objective, and validity of the method is verified by experiments both on new word identification and topic detection.To make the users have a more comprehensive understanding of the hot topics, proposed a topic-related microblog return method based on the relevance of the microblog content and topics, and also words matching. And then combine with the direct and indirect affect factors of the value of microblog content in order to effectively assess the value of microblog and sort the return microblogs, which make users can quickly understand the hot events related to hot topics and the focus of discussion of hot topics of majority of users with a small reading cost.