The Design and Implementation of Bicluster Data Analyzing Software
|School||Sun Yat-sen University|
|Keywords||data mining biclustering algorithm gene expression data index technology|
DNA microarray technology has now made it possible to produce gene expression data of thousands of genes under multiple conditions. More and more people are concerning about the method by computer to process the data and find the inherent correlation in the data.Traditional clustering algorithms usually can only cluster the data in one dimension, which makes it incapable to discover many coherent relationships in the gene expression data. In recent years, more and more people start to study the biclustering algorithms, which cluster the data simultaneously in two dimensions: the gene and the condition dimension, to find the coherent subspace in the microarray data, such subspace is also known as bicluster.We design and implement a bicluster data analyzing software, which can be used to discover the coheren subspace of the data, especially the gene expression data. Most importantly, this software can handle massive biclusters produced by the RAP and ET-Bicluster algorithms and provise a quick searching function for biclusters.This paper summarizes some typical biclustering algorithms, especially the RAP and the ET-Bicluster algorithms. Since RAP algorithm can directly generate biclusters using real-valued data and enables exhaustive discovery of coherent bicluster; the ET-Bicluster algorithm can deal with noisy data too. So we provide implementation of these two algorithms to analyze the gene expression data. Meanwhile, as many users may only want to find the biclusters including specified genes and experimental conditions, we make some modification in the algorithm to make it faster when computing the kind of biclustering. RAP and ET-Bicluster algorithm can exhaustively discover biclusters in the data, which also led to produce a large quantity of biclusters. In order to make the management of large number of biclusters easier, and searching for the biclusters including given genes and conditions faster, we studied some index technologies. We focus on the indexing technical methods, and manage to build a bitmap index and a prefix-tree index on the biclusters obtained by biclustering algorithms. For the situation that the index is too large to be read into memory completely, we make it possible to read the index data on demand. Finally, we study the approach of compressing the index, and reduce the size of the index as much as possible to reduce the extra storage space, and also speed up the index file access.Finally, we study one of the most widely used gene product database, the Gene Ontology database and achieve to do the function enrichment analysis among the bicluster, which is to compute the P-value of a bicluster, and implement both Bonferroni and FDR multiple hypothesis testing correction method.