Research on Sample Selection and Its Applications in Pattern Recognition
|School||Nanjing University of Technology and Engineering|
|Keywords||sample selection support vector machine nearest neighbor convex hull kernel nearest neighbor convex hull subspace kernel subspace kernel method pattern recognition|
The digital technologies and computer advances have led to information explosion and massive data collection. The following puzzle is how to analysis and utilize so much data. That is a common challenge for pattern recognition, data mining, neural network and machine learning. In statistical pattern recognition, the computational complexity of many classifiers increases quickly as the number of training samples increases. So when applied to a large data set, those classifiers often become computational intractable. One available way to circumvent this difficulty is to reduce the training set in advance via sample selection. Sample selection not only decreases the computational costs and speeds up the learning processes, but also might avoid the "overfitting" phenomenon and improve the prediction performance of the classifiers.This dissertation mainly studies on the sample selection strategies for some classifiers, which include linear support vector machine, nonlinear support vector machine, nearest neighbor convex hull classifier and kernel nearest neighbor convex hull classifier. These classifiers all involve solving convex quadratic programming problems, which require large memory and enormous time for large-scale problems. Therefore, it is desirable to remove the data from the training set without degrading the prediction accuracy. According to the geometric interpretations and the principles of these classifiers, the convex hull of each class plays an important role in the decisions. Thus, those samples on the edge of the convex hull are expected to be more informative and can be taken as the candidates. Based on this idea, some selection methods are presented in this dissertation. The applicability of them is demonstrated through the above classifiers using the face recognition databases or the object image library. We also compare the proposed data selection strategies with other methods in terms of their ability to reduce the training set size while maintaining the generalization performance of the classifiers.First of all, a novel sample selection method named subclass convex hull sample selection algorithm is presented. That is an iterative method. In one class, at each step only one sample, which is the furthest one to the convex hull of the chosen samples, is picked out. It can be proved that the samples selected be the points on the edge of the convex hull of the original training set. The convex hull of the chosen set can approximate the convex hull of the original set as soon as possible. The proposed selection algorithm is used to serve as the preprocessing step to the linear support vector machine and the nearest neighbor convex hull classifier. The experiments show that a significant amount of training data can be removed without degrading the performance of the resulting classifiers.The second is the kernel subclass convex hull sample selection algorithm. By replacing the dot product in the subclass convex hull sample selection algorithm with kernel functions, the proposed kernelized method is constructed. The experimental results provide the promising evidence that it is possible to successfully employ the proposed algorithm ahead of the nonlinear support vector machine and the kernel nearest neighbor convex hull classifier.Another proposed method is the subspace sample selection algorithm, which reduces the member of each class independently. It is also an iterative procedure. But at each step, the furthest sample to the subspace of the chosen samples is selected. Moreover, those chosen samples are the linear independence edge points of the convex hull of the original training set. The analytic solution of distance computation can be obtained. Therefore the method has faster selecting speed than the subclass convex hull sample selection algorithm. That has been verified by the experiments. The linear support vector machine is applied to synthetic face sets and object image library using the result of this new data reduction algorithm as the training data set. Experiments show that few samples chosen by the proposed mechanism lead to the computation time being significantly reduced without any decrease in the prediction accuracy.Similarly, the kernel trick can also be introduced into the subspace sample selection algorithm. Then, the kernel subspace sample selection algorithm is presented. By the kernel mapping, the samples in the input space can be mapped into a high-dimensional feature space in which the selection processes similar to the subspace sample selection algorithm are performed. In the experiments, computational times for the nonlinear support vector machines with the proposed reduction algorithm are much smaller than the other compared methods. Without sacrificing the classification accuracy, the proposed algorithm exhausts the less time and selects the less training samples.