Dissertation
Dissertation > Industrial Technology > Automation technology,computer technology > Computing technology,computer technology > Computer applications > Information processing (information processing) > Text Processing

Research on Text Stream Classification by Keywords

Author YangBaoGuo
Tutor ZhangYang
School Northwest University of Science and Technology
Course Computer Software and Theory
Keywords text stream classification unlabeled documents concept drift classifier ensemble knowledge acquisition
CLC TP391.1
Type Master's thesis
Year 2011
Downloads 41
Quotes 0
Download Dissertation

Traditional data stream classification usually requires a great number of fully labeled training examples to build classifiers, which is expensive and time consuming. However, in real life, the data streams are basically unlabeled, which makes the traditional data stream methods impractical. To address this problem, in recent years, research based on semi-supervised data stream classification methods has been increasingly concerned. Some researchers proposed to use partly labeled examples or only small positive examples and large amount of unlabeled examples for data stream classification. Although these approaches have reduced the cost of manual labeling, it still requires users to label some samples.To further release the burden of manual labeling, as for the text data stream classification, this paper proposed a novel approach, which uses keywords to classify text streams without manual labeling. First of all, the base classifier is built by keywords and unlabeled documents, then the documents in text stream are classified by ensemble based algorithm. In the classifier construction phase, keywords are semantically expanded and then used to label the initial positive documents. At the classification stage, the final label of unknown document is predicted by the weighted majority voting algorithm.In this paper, the concept drift in the text stream is also intensively studied. Concept drift arisen by the change of user’s interests is mainly explored in this work, and the keywords provided by the user determine the user’s current interests and the target concepts. Therefore, when the user’s interest changes, the concept drift will occur as well. This paper also simulates the common concept drift scenarios, namely, the gradual concept drift and abrupt concept shift. Furthermore, a comparative analysis is also conducted between the concept drift scenarios and the non-drift scenario.Experimental results demonstrate that the proposed method can build an excellent classifier by keywords without using any manual labeled examples, which can achieve comparable results compared with the PU learning method building classifiers by labeled positive and unlabeled documents. Moreover, the classifier ensemble method used in this paper can quickly capture and adapt to the concept drift in the text streams. Experiment results also show that the ensemble based algorithm performs better than single window based algorithm. The method proposed in this paper for text stream classification does not require manual labeled documents, which will be more practical for real-life applications.

Related Dissertations
More Dissertations