Fundamental Problems of Generalized Grey Analytical Systems and Their Research of Model Population Analysis 

Author  LiHongDong 
Tutor  LiangYiZeng 
School  Central South University 
Course  Analytical Chemistry 
Keywords  Generalized gray analytical systems Model population analysis Model assessment Variable selection Biomarker the domain of applicability of models Chemometrics Computational biology 
CLC  O651 
Type  PhD thesis 
Year  2012 
Downloads  93 
Quotes  0 
Technological innovations have revolutionized the research in analytical chemistry as well as modern life science. With the aid of high throughput analytical instruments, the availability of massive data have reshaped the statistical thinking and theories of data analysis and knowledge discovery. As a result, analytical chemistry and life science have been greatly advanced. Analytical systems in chemistry and life science in the current era often display high complexity. The chemical composition, the contents of components, the interaction among different components and their dynamic change in time are largely unknown, which poses a great challenge for analytical chemists. Fortunately, the development of high throughput instruments opens up the possibility for analytical chemists to obtain a vast amount of data about the samples through accessing hundreds of thousands of measurement channels, such as wavelengths, masstocharge ratio and genes. Also, the data mechanism is largely unknown and there is no solid physical and/or chemical law, like BeerLambert law, to guarantee a reliable model. For these reasons, this kind of analytical systems is called Generalized Gray Analytical Systems (GGAS). Notably, in most cases, the data contain, however, a very small number of samples measured at a much larger number of variables, which is well known as the "large p, small w" problem that further increases the difficult of the analysis of GGAS. As for the analysis of GGAS, we think that there exist three fundamental problems:model assessment, variable selection and the definition of the domain of applicability. These three problems have not been well solved in statistics and chemometrics. So far, most of the existing methods towards these issues are based on single data analysis and single modeling while neglecting the influence of sample and variable variation on the analytical results. Therefore, the resulting conclusions are questionable. To approach these various issues on the analysis of GGAS, we recently proposed Model Population Analysis as a general framework for designing new types chemometrics/bioinformatics algorithms towards reliable analysis of complex data, which are expected to overcome the abovementioned drawbacks. With the aid of MPA, we have established new types of distributionbased model assessment and variable selection methods and validated them using both simulated and real world data. In addition, we also conducted exploratory analysis of the domain of applicability of models. This thesis consists of four parts:model population analysis (Chapter2), model assessment (Chapter3), variable selection (Chapter4to9) and the domain of applicability of models (Chapter10). The main contents are briefly introduced as follows:1. We elucidate the context in which MPA is proposed and introduce its key elements. Any databased models are subject to the influence of those samples and variables used, which also applies to variable selection. However, most of the reported variable selection methods are based on a single model without considering the effects of sample and variable variation. Through the analysis of a large number of models resulting from sample/variable selection, we found that variable importance scores have a stable distribution, which reflects the uncertainty of variable importance caused by variation in data. Therefore, it was expected that new types of chemometrics algorithms for data analysis could be established by the statistical analysis of a population of models. With this understanding, we proposed model population analysis as a general framework for data analysis. The main idea of MPA is maximizing the information available from limited samples by obtaining the statistical distribution of an interesting parameter (sample, variable, parametric and model space) through the use of Monte Carlo sampling. From this angle of view, MPA shares similar characteristics with Bayesian analysis whose key is to look at the posterior distribution of a parameter. Emphasizing looking at a distribution rather than a single number, MPA is greatly distinguished from single model analysis. Briefly, MPA tries to maximize possibly available information by looking at the data from different angles, which shows the similar taste with Sushi’s poem "a mountain takes on different profiles if looked at from different places".(Chapter2)2. With the MPA thinking, we proposed to conduct model comparison using the distribution of prediction errors. This field is very important in chemometrics. However, the comparison conducted in most of reported literature is based on the use of a single dataset or cross validation with fixed sample partition, which obviously takes the risk of drawing a wrong conclusion. By changing test sets or sample partition, the distributions of prediction errors of different models were therefore derived and further compared statistically, thus allowing for reliable comparison. The proposed method is employed to analyze a NIR dataset as well as a metabolomics dataset. The results showed that the method can avoid drawing a false positive conclusion.(Chapter3) 3. Based on MPA, we proposed subwindow permutation analysis (SPA) for variable selection, which is based on the assumption that model’s predictive performances will be reduced significantly if an informative variable is permuted and vice versa. In this method, N submodels are first established and are then used to make predictions on test sets. For each variable, two groups of prediction errors, normal prediction errors and permutation prediction errors, are first computed. Then, a Mann Whitney U test is used to compare whether these two distributions are significantly different, giving a pvalue that measures the importance of that variable. SPA was applied to the analysis of one Type II diabetes dataset and one childhood overweight data. The important metabolites identified by SPA were shown to be discriminating and biologically meaningful, indicating that SPA is a promising method for biomarker identification.(Chapter4)4. Utilizing the idea of MPA, we designed a variable selection method specific to support vector machines (SVM), which is based on structural risk minimization and has found wide applications. However, there is a lack of literature addressing variable selection of SVMs. In theory, a SVM model with a larger margin would have better generalization ability. Based on this property, we proposed margin influence analysis (MIA). For each variable, MIA assigns an importance score by comparing its associated two distributions of margins resulting from those models including this variable and those not. SPA was applied to analyze two gene expression datasets, and promising results were achieved.(Chapter5)5. Towards the efficient identification of an optimal subsets from a large number of variables, we developed the Competitive Adaptive Reweighted Sampling (CARS) method with the importance of variable subsets assessed by the distribution of prediction errors instead of using a onerun cross validation. It is expected better variable subset selection could be achieved. This method is applied to near infrared datasets and satisfying results were observed.(Chapter6)6. With the idea of MPA, we developed a method for investigating the combinatorial importance of variables. In this method, only a percentage, e.g.5%, of models (with the lowest prediction errors) that include a given variable are chosen and the inverse of the mean prediction errors of these models are taken as a criterion for assessing variable importance. We analyzed two epidemiological datasets from the Cardiovascular Risk in Young Finns Study:the metabolic syndrome and the early atherosclerosis data. Results showed that the proposed method can effectively single out those variables that only display high importance when combined with other variables. The identified important variables also make biological sense.(Chapter7)7. Borrowing the idea of both MPA and RJMCMC, we developed the Random Frog algorithm which was shown to be very suitable for searching an optimal variable combination in a high dimensional space. We adopted a probabilitydependent model acceptation rule and invented a normaldistribution based dimensionchanging mechanism for generating N submodels. Finally, each variable is assigned a selection probability computed all the N submodels, which measures variable importance. The results on two gene expression datasets demonstrated that random frog showed great advantages over existing methods.(Chapter8)8. With the aid of MPA, we proposed the Variable Complementary Network (VCN) method, and with it we investigated variable complementary information which was a concept originally developed by us. Life is a system, and biological variables only work in the presence of others, providing the evidence of the existence of complementary information which is however rarely studied. Based on a population of multivariate models, we proposed a formula for quantitatively calculating this information and visualized it using a network graph. The network intuitively reveals how variables complement each other, also provides a method for biomarker discovery. Very nice results were obtained when the VCN method was applied on one type II diabetes data and one post operation cognitive dysfunction data.(Chapter9)9. We proposed to look at the sample similarity and domain of applicability of models through component spectral space and/or measured variable space. The component spectral space provides a qualitative tool for investigating sample similarity because any uncalibrated interfering components will lead test samples to be away from the domain of applicability of models. Regaring measure variable space, it can help assess the efficacy of variables because any variable that is not relevant to our interested problem will not benefit a model, thus rendering variable selection important. From the perspective of component spectral space, we performed exploratory investigation of the domain of applicability of models.(Chapter10)