The Research and Application of Chromatographic Fingerprint Data Processing Algorithms
|School||Dalian University of Technology|
|Course||Computer Software and Theory|
|Keywords||Chromatography Metabolomics Data Mining Data Fusion Time Series|
Chromatography technique is commonly used in metabolomics analysis. Chromatogra-phy separation and detection can give researchers the chromatography fingerprint profile of the sample metabolites for future metabolomics analysis. Hundreds or even thousands of metabolites can be detected by chromatography technique while usually only several dozens of samples are available, which leads to "large P small N" problem. And this increases the difficulty of the analysis of the metabolomics chromatography fingerprint data. So data mining methods are introduced into the metabolomics analysis.The analysis of flue-cured tobacco chromatographic data is one of the important applications of plant metabolomics. To satisfy the demand of storage and analysis of flue-cured tobacco chromatographic data, a tobacco chromatography fingerprint software is developed and also deployed in productive environment. Meanwhile, data fusion methods are often used in analysis of flavor character of tobacco from different years. However, the diverse climates in different years influence the flavor character of the tobacco samples. In order to fuse chromatography data from different years effectively, a data fusion method based on statistical hypothesis test and local scaling is proposed in this paper. It eliminates the climate influence by performing scaling on the selected features which represent the climate influence. The proposed method is applied to fuse flue-cured tobacco samples in Guizhou province in two years. Compared with existing data bias correction fusion method, the proposed method effectively eliminates difference caused by climate, and the classification accuracy of both random forest and support vector machine get increased.The other aspect of the work is to investigate the metabolomics time series chromatography data and time series random forest classification algorithm. A new time series random forest algorithm combined with regular change measurement of the time series is proposed. Compared with normal time series random forest, the proposed algorithm considers both the distinguish capability and the variation property of the time series. The proposed algorithm is applied to time series classification experiment of silkworm and shows its advantage over normal time series random forest.