Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Computer network > General issues > The application of computer network > Web browser

Blog Distillation with Feed Similarity Distribution

Author GaoZuoJi
Tutor GuoJun
School Beijing University of Posts and Telecommunications
Course Signal and Information Processing
Keywords blog distillation svm class imbalance rocchio tf-idf similarity distribution
CLC TP393.092
Type Master's thesis
Year 2010
Downloads 75
Quotes 0
Download Dissertation

Blog Distillation is the process of finding a blog with a principle, recurring interest for users to add to their RSS reader. People on the Internet are interested in finding blogs with articles in concentrated interest, such as basketball, movie, political election, and so on. They can subscribe these blogs by RSS technique to obtain the latest information in this interest. Blog distillation can help people find friends with same interest, or expert in certain realm. In this paper, we introduced a innovative blog distillation method which is based on feed similarity distribution. Compared with traditional methods, our method have some innovations listed below:1) A series of experiments has been done to investigate the relationship of the number of called back documents and the returning MAP. Based on the experiment results, we design 3 baseline experiments to address the blog distillation task. The results shows that our baseline C which based on the average of similarity scores can get higher MAP with less documents calling back. It performs very well and gets 1st place in TREC 2009 BLOG DISTILLATION Task, while beating a lot of famous universities, including Umas, USI.2) In this paper, we use the similarity distribution of feeds as the feature, and transfer the work of "mining the relationship between query and words" to "mining the relations between relationship and similarity distribution ", which make the problem much more fundamental. A innovative quality-quantity curve was used to visualize the distribution feature of different kinds of blogs. It results high MAP when classifying this kind of feature with SVM.3) Introduce the class imbalance problem to Blog distillation task. Insuring the same class distribution of training set and testing set is the fundamental of classification theory in machine learning. Considering the class imbalance in our training set, we predict the ground truth of class distribution in our testing set, and give this feedback to the original training set. After modifying the class balance in training set, it predicts the individual class in the destination dataset. The result shows high performance especially in TREC 2007.

Related Dissertations
More Dissertations