Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network > Web browser

Hot Topic Detection Based on Microblog Data

Author SunLi
Tutor WangXiaoJie
School Beijing University of Posts and Telecommunications
Course Computer Science and Technology
Keywords Natural Language Processing Microblog New Words Identification TopicDetection Heat evaluation
CLC TP393.092
Type Master's thesis
Year 2013
Downloads 163
Quotes 0
Download Dissertation

As an emerging Internet media, microblog has gradually become a platform for majority of users to express their views and share information, there can be millions of microblogs released each day, the huge amount if information makes it difficult for users to browse all of the microblogs. At the same time, the propagation velocity of microblog topics is fast, the transmission range is wide and the social influence is high, therefore accessing hot topics from microblog data and return the relevant important microblogs can help users to quickly grasp the Public Interest, this has a high value for all kinds of microblog users to quickly understand key information. Meanwhile, the building way of microblog platform based on user relationship makes users can only receive relevant microblog information but can not directly receive the hot topic information of the entire microblog network, therefore hot topic detection from the microblog data mining can obtain a better user experience. Although microblog platform now has application such as hot topic list, it needs a lot of manual editing factors and main measure is term frequency, so it is difficult to reflect the true situation.This paper studies the topic detection and heat judgment related technologies at home and abroad first, then analyze hot topic of microblog data and related research on the application of the existing microblog hot topic, proposed a hot topic detection method based on the LDA model which can fully tap the theme information of the text for the shortcomings of the existing methods without traditional clustering methods. First, starting from the microblog content features, using N-gram model to extract repeated strings, then use statistical characteristics including both absolute and relative term frequency, mutual information, and adjacency information entropy to filter spam strings and extract microblog new words, so as to enhance the accuracy of segmentation results. Then use LDA model to mining theme information of microblog data, and treat theme as topic so getting a list of candidate topics, meanwhile determine the distribution of the topics on the words and the distribution of the documentation on the topics. At last, untilizing the results of GibbsLDA++tool, make each word and its respective topic a whole unit which is called single-word unit, calculate the weight of single-word units corresponding to words, so as to calculate the heat of topic, and finally find the hottest topics. The method using both the the time features and content feature of microblog and is more targeted, and rule out human-edited factors, so the topics are more objective, and validity of the method is verified by experiments both on new word identification and topic detection.To make the users have a more comprehensive understanding of the hot topics, proposed a topic-related microblog return method based on the relevance of the microblog content and topics, and also words matching. And then combine with the direct and indirect affect factors of the value of microblog content in order to effectively assess the value of microblog and sort the return microblogs, which make users can quickly understand the hot events related to hot topics and the focus of discussion of hot topics of majority of users with a small reading cost.

Related Dissertations
More Dissertations