Research on Content-Based Spam Filtering Technology
|Southwest Jiaotong University
|Applied Computer Technology
|spam filtering text categorization feature selection support vector machine
With the development of Internet, Electronic mail is also developed rapidly for its convenience. However, at the same time, the growing spam emails have generated a lot of damage to economies and people’s lives. Therefore, it is very important to research spam filtering. Spam filtering is one of the key technologies of anti-spam. Nowadays, there are three main kinds of E-mail filter technologies, which are based on black or white list technology, manual rules and content-based filtering. In our research, the E-mail filter is addressed which based on the contents of E-mail.This thesis which analyses currently E-mail filtering technologies, employs the theory of text categorization, introduces support vector machine of machine learning into anti-spam. Because of only support vectors have contributed to classification, support vector machine wastes a lot of time over the optimization of non support vectors, which terribly influences the efficiency of support vector machine, but anti-spam calls for higher real time. Accordingly, to deal with this problem, two improved support vector machine algorithms have been applied to anti-spam. The results illustrate they both can speedup the training and testing without reducing the precision of classification.The main jobs of the thesis are as following:1. Firstly, we compare and select benchmark E-mail corpus, complete preprocessing of E-mails. Realize information gain of feature selection algorithm. According to experiments, the appropriate feature dimensions of PU serial corpus are given respectively. The weights of features are calculated by formula, and then the E-mail corpus is presented in vector space model which can be processed by the computer.2. Secondly, to deal with the problem of high complexity cost of support vector machine, we propose two improved algorithms: getborder sequential minimize optimization algorithm and nearest neighbor cluster sequential minimize optimization algorithm. Experiment results demonstrate the effectiveness of the improved algorithms which have low complexity cost and satisfy the classification precision of E-mailfilter.3. Thirdly, mistakenly treating a legitimate E-mail as spam can be a more severe error than treating a spam E-mail as legitimate, to solve this problem, we introduce different punish parameters for processing unbalanced data to anti-spam, higher precision has been obtained.