An Improvement of Cluster on Phylogenetic Profiling Method
|School||Northeast Normal University|
|Keywords||Phylogenetic profile Weights Hierarchical clustering K-means clustering Biology Distance K-means initial sample|
With efficient, automated sequencing technology, bioinformatics central topic, has moved from sequencing of genes transferred to the analysis of genes that have been sequenced, mainly on the study of gene function and notes. Since the method itself homology defects and accuracy problem, people began to pay attention nonhomologous method. Nonhomologous sequence method is mainly through classified attributes, and thus functional predictions. Phylogenetic spectrometry method in numerous nonhomologous most widely used applications. Phylogenetic spectrum method proposed in 1999 by Pellegrini, followed by many scholars reference group selected from the gene, constructed phylogenetic spectrum, spectral similarity analysis of these three aspects of its improvement. This article on this solid foundation, first constructed based on the weight of the phylogenetic profile, then alternate use hierarchical clustering method and K-means clustering method for similarity analysis. In the spectral similarity analysis phase, we propose two improvements: First, propose a new distance for hierarchical clustering clustering phase. The second is from hierarchical clustering method to extract more information for K-means clustering method to provide initial information, the full utilization of the results of hierarchical clustering, K-means clustering method makes the results more accurate. Currently in the clustering algorithm, the main application is the Euclidean distance. Because we are dealing with a sample of mostly belong to Euclidean space, so using Euclidean distance clustering can get good results. The distance used herein, is a non-Euclidean space distance. Compared to the Euclidean distance, it reinforces the known information on the sample from the impact. It is not only the distance between the sample considered, but also consider the sample and the reference sample distance. Using this new distance, allows us to give priority to a similar sample with a known reference. K-means clustering method flaw is that the sensitivity of selected initial conditions: the initial number of clusters K and the initial cluster target selection, will finally have a significant impact on clustering results. Currently the K-means algorithm mainly in the selection of the initial information. Predecessors using hierarchical clustering combined with K-means clustering method used, the purpose is to use hierarchical method of K-means clustering method provides the initial number of clusters K. In this paper, based on the results from the hierarchical clustering method to extract more useful information, K-means clustering method is given an initial clustering objectives. Finally, with Escherichia coli K12 genome as a test sample, test verification for these improvements. The experiment showed that, compared with the previous result, the new algorithm with higher accuracy.