An Approach for Identifying a Plant Resistance Gene Based on the Random Forest
|School||Harbin Institute of Technology|
|Course||Computer Science and Technology|
|Keywords||Resistance gene Feature extraction Under-sampling Random forest|
The research towards Plant Resistance-Gene develops as one of the most important topics in bioinformatics. Since the first resistance gene was successfully found, more than 70 R-gene have been gradually verified by confirmatory experiment until now, with applying to Molecular Breeding, Trans-gene and the like. Besides, more and more bioinformatics researchers are dedicated to mining resistance genes, analyzing its function and biochemical mechanisms. However, some problems are still remains such as the low efficiency of current mining approach and the high false positive. In this thesis, we have analyzed the R-gene structure and exploited the machine learning approach to predict resistance gene.In our approach, we have selected the protein sequences encoded by R-gene as the research object, converting the R-gene identification problem to a Two-Class classification problem of machine learning. Firstly, we have assayed the conserved domains of resistance protein, and the effect of physical and chemical properties on the protein sequences, then a group of 188 valid features has been defined to represent the sequence. Secondly, we has utilized the under-sampling approach based on the K-Means algorithm to rebuild the training sets, aiming at solve the imbalance learning problem in R-gene classification. Finally, we have built a Random forest classifier on the new training sets to realize the R-gene classification. The specificity and sensitivity under our approach all exceed 80%, and the false positive in the R-gene identification can be notably reduced. The experimental results validate that our algorithm on R-gene classification is cogent and effective.