Research on Text Stream Classification by Keywords
|School||Northwest University of Science and Technology|
|Course||Computer Software and Theory|
|Keywords||text stream classification unlabeled documents concept drift classifier ensemble knowledge acquisition|
Traditional data stream classification usually requires a great number of fully labeled training examples to build classifiers, which is expensive and time consuming. However, in real life, the data streams are basically unlabeled, which makes the traditional data stream methods impractical. To address this problem, in recent years, research based on semi-supervised data stream classification methods has been increasingly concerned. Some researchers proposed to use partly labeled examples or only small positive examples and large amount of unlabeled examples for data stream classification. Although these approaches have reduced the cost of manual labeling, it still requires users to label some samples.To further release the burden of manual labeling, as for the text data stream classification, this paper proposed a novel approach, which uses keywords to classify text streams without manual labeling. First of all, the base classifier is built by keywords and unlabeled documents, then the documents in text stream are classified by ensemble based algorithm. In the classifier construction phase, keywords are semantically expanded and then used to label the initial positive documents. At the classification stage, the final label of unknown document is predicted by the weighted majority voting algorithm.In this paper, the concept drift in the text stream is also intensively studied. Concept drift arisen by the change of user’s interests is mainly explored in this work, and the keywords provided by the user determine the user’s current interests and the target concepts. Therefore, when the user’s interest changes, the concept drift will occur as well. This paper also simulates the common concept drift scenarios, namely, the gradual concept drift and abrupt concept shift. Furthermore, a comparative analysis is also conducted between the concept drift scenarios and the non-drift scenario.Experimental results demonstrate that the proposed method can build an excellent classifier by keywords without using any manual labeled examples, which can achieve comparable results compared with the PU learning method building classifiers by labeled positive and unlabeled documents. Moreover, the classifier ensemble method used in this paper can quickly capture and adapt to the concept drift in the text streams. Experiment results also show that the ensemble based algorithm performs better than single window based algorithm. The method proposed in this paper for text stream classification does not require manual labeled documents, which will be more practical for real-life applications.