FastCluster: a graph theory based algorithm for removing redundant sequences

Peng-Fei Liu; Yu-Dong Cai; Zi-Liang Qian; Sheng-Yu Ni; Liu-Huan Dong; Chang-Hong Lu; Jin-Long Shu; Zhen-Bing Zeng; Wen-Cong Lu

doi:10.4236/jbise.2009.28090

Journal of Biomedical Science and Engineering > Vol.2 No.8, December 2009

FastCluster: a graph theory based algorithm for removing redundant sequences

Peng-Fei Liu, Yu-Dong Cai, Zi-Liang Qian, Sheng-Yu Ni, Liu-Huan Dong, Chang-Hong Lu, Jin-Long Shu, Zhen-Bing Zeng, Wen-Cong Lu
.
DOI: 10.4236/jbise.2009.28090 PDF HTML 6,639 Downloads 11,893 Views Citations

Abstract

In many cases, biological sequence databases contain redundant sequences that make it difficult to achieve reliable statistical analysis. Removing the redundant sequences to find all the real protein families and their representatives from a large sequences dataset is quite important in bioinformatics. The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics. This paper presents a novel program named FastCluster on the basis of mathematical graph theory. The algorithm makes an improvement to Hobohm and Sander’s algorithm to generate non-redundant protein sequence sets. FastCluster uses BLAST to determine the similarity between two sequences in order to get better sequence similarity. The algorithm’s performance is compared with Hobohm and Sander’s algorithm and it shows that Fast- Cluster can produce a reasonable non-redundant pro- tein set and have a similarity cut-off from 0.0 to 1.0. The proposed algorithm shows its superiority in generating a larger maximal non-redundant (independent) protein set which is closer to the real result (the maximum independent set of a graph) that means all the protein families are clustered. This makes Fast- Cluster a valuable tool for removing redundant protein sequences.

Keywords

Blast; Graph Theory; Redundant Sequences; CD-HIT

Share and Cite:

Liu, P. , Cai, Y. , Qian, Z. , Ni, S. , Dong, L. , Lu, C. , Shu, J. , Zeng, Z. and Lu, W. (2009) FastCluster: a graph theory based algorithm for removing redundant sequences. Journal of Biomedical Science and Engineering, 2, 621-625. doi: 10.4236/jbise.2009.28090.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	G. Grillo, M. Attimonelli, S. Liuni, G. Pesole. (1996) CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases. CABIOS, 12, 1–8.
[2]	L. Holm and C. Sander. (1998) Removing near- neighbour redundancy from llarge protein sequence collections. Bioinformatics, 14, 423–429.
[3]	W. Li and A. Godzik. (2006) Cd-Hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659.
[4]	W. Li, J L. aroszewski, A. Godzik. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283.
[5]	U. Hobohm, M. Scharf, R. Schneider, C. Sander. (1992) selection of representative protein data sets. Protein Sci, 1, 409–417.
[6]	U. Hobohm and C. Sander. (1994) Enlarged representative set of protein structures. Protein Sci, 3, 522–524.
[7]	G. Wang and R. L. Jr. Dunbrack. (2003) PISCES: a protein sequence culling server. Bioinformatics, 12, 1589– 1591.
[8]	S. F. Altschul, T. L. Madden, A. A. Sch?ffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25, 3389– 3402.
[9]	S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman. (1990) basic local alignment search tool. J. Mol. Biol., 215, 403–410.
[10]	S. Niskanen and P. R. J. ?sterg?rd. (2003) Cliquer user’s guide, Version 1.0, Communications Laboratory, Helsinki University of Technology, Espoo, Tech. Rep. T48. http://users.tkk.fi/~pat/cliquer.html.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies