Merging Multiple Microarry Datasets to Build Gene Regulatory Networks 

Author  LiuZuoZuo 
Tutor  ZhouChunGuang 
School  Jilin University 
Course  Applied Computer Technology 
Keywords  Gene Regulatory Network Bayesian Network Line Sorting Bayesian Network Structure Learning Algorithm Multiple Data Fusion 
CLC  Q811.4 
Type  Master's thesis 
Year  2010 
Downloads  69 
Quotes  2 
Gene regulatory network is the cross point comes from the result of the infiltrated from the computer science, mathematics, informatics to the Molecular Biology. It is to use bioinformatics methods and techniques through data collection, analysis, modeling, simulation and inference to study the relationship between the gene network.After the basic map of the human genome had been done, the human genome research program, has entered into the postgenome era, the focus of the life science research has changed from how the genome sequencing research to the study of the measured function of the gene sequences we had. Living body is a complex orqanism, interaction between the various units constitute a complete life. Therefore, the only independent study of genes is difficult to interpret the mystery of life. So, to clarify the interaction between genes of today’s life science research has become one of the hot researches. So, build gene regulatory networks, finding the interaction of the genes by analyzing the gene regulatory network became the focus of this study.Gene network research began in the 1960s. Rater describes the character of the molecular control of prokaryotic gene system. The essential of gene regulatory network is a continuous and complex dynamic system, namely, the complex power system network.At present, there are many ways based on the microarray datasets to build gene regulatory networks, Take something commonly for example:(1) Clustering. (2) Boolean Network. (3) Neural Network. (4) Differential Equation. (5) Bayesian Network.This paper chose Bayesian network model to make improvements. The advantage of the Bayesian Network model is:(1) Clearly acyclic directed graph model can reveal the causal relationship in the gene expression based on statistical assumptions. (2)There are already many advanced Bayesian network algorithms build from the observational data. (3)Be able to handle data with noise and estimate the confidence of the networks with different characteristics.Data fusion, also known as information fusion, Rose during to the need of the military applications, is a new technology for multisource information, to obtain more accurate and credible conclusions than any single source of information.Now, data fusion method has been widely used in military and civilian fields. At present, the fusion of multiple data sources applied in the area of bioinformatics is still in the initial stage. There are two main ideas:One is take the results of one experiment as a prior knowledge. This method is simple and convenient, and the effect obviously, so it is the main mind of merging multiple datasets. But the disadvantage is that this method requires a higher experience. Although this method can have a strong ability to reproduce, but for the unknown relationship the predictive power of a relatively limited. The second method is to build the networks of those similar datasets, then merge the networks. The existing methods merged rely on either the number of lines appeared (Graph based) or the center of many matrix (Matrix based). As different dataset has its own characteristics, The result of simply merge does not belong to any dataset, losing the specificity of the network structure.In this paper, we based on the careful anglicizing and summarizing the existing work of our predecessors, Combining the knowledge and status of the biology and computer, Proposed a new algorithm based on fusion multiple datasets:Line Sorting Bayesian Network Structure Learning Algorithm, (LSBN Algorithm). First this algorithm take use of the existing network constructing method for each dataset to build networks, network construction method can be optional, just its results satisfy the restrictions that the graph must be directed acyclic graph, this paper chose the classic algorithm:K2 algorithm. Then, all of the edges become the ancillary information for LSBN algorithm which reconstructs the existing network, during the reconstruction, if we choose dataset A as the reconstructed dataset, then we get the Bayesian Network based on dataset A. we can improve the precision of a single dataset by merging multiple datasets into the reconstructed datasets. Experimental results show that,LSBN algorithm can improving the accuracy of network by adding the right lines, and deleting the wrong lines. Experimental results also show that, LSBN algorithm can improving the accuracy of network with the time complexity O(N).This paper chooses the cerevisiae cell cycle microarray datasets of Spellman, with reference to the experimental data of Lee. Selected Swi6, Mbpl, Mcml, Ndd1, Fkh2, Swi4, Ace2 regulators of these seven regulators in G1cycle, adding Fh11 regulator, a total of eight regulators have been selected. We can refer to the website:http://web.wi.mit.edu/young/Regulatory_network/to get the relation of these regulators. We have selected some of data in the six datasets of Lee’s randomly to build network. As there are only two data both in experiment Cln3 and in experiment clb2, so all have been selected, we also choose 10 data from each of the other four datasets. For the missing data, we have adopted KNNimpute algorithm to filling, choose microarray datasets of alpha, as reconstructed data, take the other five groups as ancillary data, using the Bootstrap method, we can rebuilding the network many times to choose lines of a higherfrequency. Experimental results show that the algorithm can improve the accuracy of network effectively, by improving the accuracy from 42% to 54%. The accuracy has improved for 12%. We also found that the accuracy of the algorithm is slightly better than the mere use of large amounts of data using the K2 algorithm, Indicated that when the amount of data is limited, taking use of LSBN algorithm, we can equal or better than the mere building a network with K2 algorithm using a large amounts of data.This paper is composed by the five chapters; the general contents are as follow:ChapterⅠ:brief introduction of background and the status of the areas referred by this paper.ChapterⅡ:descript the relevant background knowledge of the papers, including gene regulatory networks, Bayesian networks, and the status of merging multiple datasets.ChapterⅢ:proposed a new Bayesian Network Algorithm based on the fusion of multiple datasets structure learning algorithm:Line Sorting Bayesian Network Structure Learning Algorithm.ChapterⅣ:proposed the procedures of the experiments in this paper, and the results of the experiments.ChapterⅤ:presented the summary and prospect of this paper, analyzed the strengths and weakness of this new algorithm, and the focus of the future research work.