Key Technique Research on Cooperative Spam Filtering
|School||Lanzhou University of Technology|
|Course||Control Theory and Control Engineering|
|Keywords||Spam Spam Filtering Crossed N-gram Cost Sensitive SVM P2P Network Similar Text Retrieval Similar Image Retrieval|
With the rapid development of Internet, email has become an important communication approach.Mass unsolicited electronic mail,often known as spam, has increased enormously and become a serious threat to not only the Internet but also the society. How to effectively restrain the spread of spam has become an important research topic in related fieldsBased on the domestic current situation of spam flooding, the research on the key technologies of spam filtering is conducted in the thesis.It includes the features constructing of Chinese email,emails classifying algorism based on support vector machine(SVM) and spam samples sharing sheme base on P2P network.The main work is summarized as following:1.In order to solve the bottleneck problem of Chinese word segmentation in spam filters, the method of feature constructing based on crossed N-gram has been presented.It can reduce the impact of word segmentation mistake on filter performance.At the same time it can overcome the defect of independence assumption which does not match with the actual situation in traditional feature extraction.The effectiveness of the strategy has been proved by the experiments using Naive Bayesian classifier in Chinese email on the open corpus.Experiment results also show that the strategy has a strong capacity of noise immunity. Under certain scenarios,it can identify the counterfeit spam with word processing.2.With its solid theoretical foundation and high performance in actual application, SVM has become a new research hot spot in machine learning.In the thesis, the effectiveness of support vector machine for email classification is verified by detailed experiments. The performance of SVM is analyed with various kernel functions and distribution of training samples.In accordance with the special requirements of email classification, a cost sensitive support vector machine is presented.By adjusting the parameters of learning machine,the false positivity in spam filtering can be effectively reduced.Thus the cost sensitive SVM is more suitable in with spam filtering applications.Combining the feature structuring of crossed N-gram, cost sensitive SVM can perform better for filtering spam in Chinese.By analyzing the distribution of misclassified email samples, a modified classification algorithm is proposed to classify all emails into three categories: normal email,spam and suspicious email.The algorithm can help the SVM reduce misclassification for normal emails, so the practicability of the classifier is improved.3.In order to ensure the effectiveness in application, the semantic changes of spams must be tracked for various filters.At the same time, spam filter is still based on the approximate images detection for image format spam filtering. In these two procedures relatively complete spam samples are needed as the basis. Peer-to-Peer (P2P) technique is an effective strategy for sharing the distributed resources. On the basis of profound study of totally distributed structured P2P and the distributed Hash table route algorithms, aiming at the defect that the structured P2P can not realize the fuzzy query for the resource,combing similar text detection and similar image detection skills,an extended route strategy is proposed which is based on Chord route algorithm and realizes the collected storage of similar text and similar text in the structured P2P network, and satisfies the balanced load request in distributed storage system.So the highly effective sharing of spam email samples is realized in the structured P2P network.Finally, in order to overcome the limit of technology of single point or single species for anti-spam, combining above key technologies, a combination of multi-level, multi-point collaborative filtering architecture is presented for more effective filtering of all kinds of spam.