Genome sequencing and next-generation sequence data analysis: A comprehensive compilation of bioinformatics tools and databases

Abstract

Genomics has become a ground-breaking field in all areas of the life sciences. The advanced genomics and the development of high-throughput techniques have lately provided insight into whole-genome characterization of a wide range of organisms. In the post-genomic era, new technologies have revealed an outbreak of prerequisite genomic sequences and supporting data to understand genome wide functional regulation of gene expression and metabolic pathways reconstruction. However, the availability of this plethora of genomic data presents a significant challenge for storage, analyses and data management. Analysis of this mega-data requires the development and application of novel bioinformatics tools that must include unified functional annotation, structural search, and comprehensive analysis and identification of new genes in a wide range of species with fully sequenced genomes. In addition, generation of systematically and syntactically unambiguous nomenclature systems for genomic data across species is a crucial task. Such systems are necessary for adequate handling genetic information in the context of comparative functional genomics. In this paper, we provide an overview of major advances in bioinformatics and computational biology in genome sequencing and next-generation sequence data analysis. We focus on their potential applications for efficient collection, storage, and analysis of genetic data/information from a wide range of gene banks. We also discuss the importance of establishing a unified nomenclature system through a functional and structural genomics approach.

Share and Cite:

Jimenez-Lopez, J. , Gachomo, E. , Sharma, S. and Kotchoni, S. (2013) Genome sequencing and next-generation sequence data analysis: A comprehensive compilation of bioinformatics tools and databases. American Journal of Molecular Biology, 3, 115-130. doi: 10.4236/ajmb.2013.32016.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Swindells, M., Rae, M., Pearce, M., Moodie, S., Miller, R. and Leach, P. (2002) Application of high throughput computing in bioinformatics. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 360, 1179-1189. doi:10.1098/rsta.2002.0987
[2] Kann, M.G. (2010) Advances in translational bioinformatics: Computational approaches for the hunting of disease genes. Brief Bioinformatics, 11, 96-110. doi:10.1093/bib/bbp048
[3] Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D., et al. (2002) DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature, 456, 200866-2000872. doi:10.1038/nature07485
[4] Isakov, O., Modai, S. and Shomron, N. (2011) Pathogen detection using short-RNA deep sequencing subtraction and assembly. Bioinformatics, 27, 2027-2030. doi:10.1093/bioinformatics/btr349
[5] Mardis, E.R. (2008) The impact of next-generation sequencing technology on genetics. Trends in Genetics, 24, 133-141. doi:10.1016/j.tig.2007.12.007
[6] Koboldt, D.C., Ding, L., Mardis, E.R. and Wilson, R.K. (2010) Challenges of sequencing human genomes. Brief Bioinformatics, 11, 484-498. doi:10.1093/bib/bbq016
[7] Clarke, S.C. (2005) Pyrosequencing: Nucleotide sequencing technology with bacterial genotyping applications. Expert Review of Molecular Diagnostics, 5, 947-953. doi:10.1586/14737159.5.6.947
[8] Claesson, M.J., O’Sullivan, O., Wang, Q., Nikkilä, J., Marchesi, J.R., Smidt, H., de Vos, W.M., Ross, R.P., and O’Toole, P.W. (2009) Comparative analysis of pyrosequencing and a phylogenetic microarray for exploring microbial community structures in the human distal intestine. PLoS One, 20, e6669. doi:10.1371/journal.pone.0006669
[9] Hamady, M., Lozupone, C. and Knight, R. (2010) Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. International Society for Microbial Ecology Journal, 4, 17-27. doi:10.1038/ismej.2009.97
[10] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y-J. and Chen, Z. (2005a) Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437, 376-380. doi:10.1038/nature03959
[11] Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53-59. doi:10.1038/nature07517
[12] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20, 1297-1303. doi:10.1038/nature07517
[13] McKernan, K.J., Peckham, H.E., Costa, G.L., McLaughlin, S.F., Fu, Y., et al. (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research, 19, 1527-1541. doi:10.1101/gr.091868.109
[14] Eid, J., Fehr, A., Gray J., Luong, K., Lyle, J., et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science, 323, 133-138. doi:10.1126/science.1162986
[15] Chan, E.Y. (2009) Next-generation sequencing methods: Impact of sequencing accuracy on SNP discovery. Methods in Molecular Biology, 578, 95-111. doi:10.1007/978-1-60327-411-1_5
[16] Dalloul, R.A., Long, J.A., Zimin, A.V., Aslam, L., Beal, K., et al. (2010) Multi-platform next generation sequencing of the domestic turkey (Meleagris gallopavo): Genome assembly and analysis. PLoS Biology, 8, e1000475. doi:10.1371/journal.pbio.1000475
[17] Nothnagel, M., Herrmann, A., Wolf, A., Schreiber, S., Platzer, M., Siebert, R., Krawczak, M. and Hampe, J. (2011) Technology-specific error signatures in the 1000 Genomes Project data. Human Genome, 130, 505-516. doi:10.1007/s00439-011-0971-3
[18] Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research, 8, 175-185. doi:10.1101/gr.8.3.175
[19] Castellana, S., Romani, M., Valente, E.M. and Mazza, T.A. (2012) Solid quality-control analysis of AB SOLiD short-read sequencing data. Brief Bioinformatics, 13, 1-12. doi:10.1093/bib/bbs048
[20] Parkinson, N.J., Maslau, S., Ferneyhough, B., Zhang, G., Gregory, L., Buck, D., Ragoussis, J., Ponting, C.P. and Fischer, M.D. (2012) Preparation of high-quality next-generation sequencing libraries from picogram quantities of target DNA. Genome Research, 22, 125-133. doi:10.1101/gr.124016.111
[21] Allen, J.E., Pertea, M. and Salzberg, S.L. (2004) Computational gene prediction using multiple sources of evidence. Genome Research, 14, 142-148. doi:10.1101/gr.1562804
[22] Sleator, R.D. (2010) An overview of the current status of eukaryote gene prediction strategies. Gene, 461, 1-4. doi:10.1016/j.gene.2010.04.008
[23] Tompa, M. (1999) An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. International Conference on Intelligent Systems for Molecular Biology, 1999, 262-271.
[24] Tompa, M., Li, N., Bailey, T.L., Church G.M., Moor B.D., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23, 137-144. doi:10.1038/nbt1053
[25] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D. J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410. doi:10.1016/S0022-2836(05)80360-2
[26] Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., et al. (2009) BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421. doi:10.1186/1471-2105-10-421
[27] Flicek, P. and Birney, E. (2009) Sense from sequence reads: Methods for alignment and assembly. Nature Methods, 6, S6-S12. doi:10.1038/nmeth.1376
[28] Lassmann, T., Hayashizaki, Y. and Daub C.O. (2011) SAMStat: Monitoring biases in next generation sequencing data. Bioinformatics, 27, 130-131. doi:10.1093/bioinformatics/btq614
[29] Krawitz, P., Rödelsperger, C., Jäger, M., Jostins, L., Bauer, S. and Robinson, P.N. (2010) Microindel detection in short-read sequence data. Bioinformatics, 26, 722-729. doi:10.1093/bioinformatics/btq027
[30] Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 754-1760. doi:10.1093/bioinformatics/btp324
[31] Pasaniuc, B., Zaitlen, N. and Halperin, E. (2011) Accurate estimation of expression levels of homologous genes in RNA-seq experiments. Journal of Computational Biology, 18, 459-468. doi:10.1089/cmb.2010.0259
[32] Durbin, R.M., Abecasis, G.R., Altshuler, D.L., Auton, A., Brooks, L.D., et al. (2010) A map of human genome variation from population-scale sequencing. Nature, 467, 1061-1073. doi:10.1038/nature09534
[33] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402. doi:10.1093/nar/25.17.3389
[34] Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B.A., de Castro, E., Lachaize, C., Langendijk-Genevaux, P.S. and Sigrist, C.J. (2008) The 20 years of PROSITE. Nucleic Acids Research, 36, D245-D249. doi:10.1093/nar/gkm977
[35] Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R. and Bateman, A. (2010) The Pfam protein families database. Nucleic Acids Research, 38, D211-222. doi:10.1093/nar/gkp985
[36] Pirovano, W. and Heringa, J. (2010) Protein secondary structure prediction. Methods in Molecular Biology, 609, 327-348. doi:10.1007/978-1-60327-241-4_19
[37] Raghava, G.P., Searle, S.M., Audley, P.C., Barber, J.D. and Barton, G.J. (2003) OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47. doi:10.1186/1471-2105-4-47
[38] Stebbings, L.A. and Mizuguchi, K. (2004) HOMSTRAD: Recent developments of the homologous protein structure alignment database. Nucleic Acids Research, 32, D203-D207. doi:10.1093/nar/gkh027
[39] Edgar, R.C. (2004b) MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32, 1792-1797. doi:10.1093/nar/gkh340
[40] Thompson, J.D., Koehl, P., Ripp, R. and Poch, O. (2005) BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins, 61, 127-136. doi:10.1002/prot.20527
[41] Van Walle, I., Lasters, I. and Wyns, L. (2005) SABmarka benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21, 1267-1268. doi:10.1093/bioinformatics/bth493
[42] Subramanian, A.R., Weyer-Menkhoff, J., Kaufmann, M. and Morgenstern, B. (2005) DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics, 6, 66. doi:10.1186/1471-2105-6-66
[43] Stinchcombe, J.R. and Hoekstra, H.E. (2008) Combining population genomics and quantitative genetics: Finding the genes underlying ecologically important traits. Heredity, 100, 158-170. doi:10.1038/sj.hdy.6800937
[44] Fridman, E. and Pichersky, E. (2005) Metabolomics, genomics, proteomics, and the identification of enzymes and their substrates and products. Current Opinion in Plant Biology, 8, 242-248. doi:10.1016/j.pbi.2005.03.004
[45] Middleton, F.A., Rosenow, C., Vailaya, A., Kuchinsky, A., Pato, M.T. and Pato, C.N. (2007) Integrating genetic, functional genomic, and bioinformatics data in a systems biology approach to complex diseases: Application to schizophrenia. Methods in Molecular Biology, 401, 337-364. doi:10.1007/978-1-59745-520-6_18
[46] Lahdesmakia, H., Hautaniemia, S., Shmulevichc, I. and Yli-Harja, O. (2006) Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks. Signal Processing, 86, 814-834. doi:10.1016/j.sigpro.2005.06.008
[47] Goble, C. and Stevens, R. (2008) State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics, 41, 687-693. doi:10.1016/j.jbi.2008.01.008
[48] Zhang, Z., Cheung, K.H. and Townsend, J.P. (2009) Bringing Web 2.0 to bioinformatics. Brief Bioinformatics, 10, 1-10. doi:10.1093/bib/bbn041
[49] Shah, S.P., Huang, Y., Xu, T., Yuen, M.M.S., Ling, J. and Ouellette B.F.F. (2005) Atlas—A data warehouse for integrative bioinformatics. BMC Bioinformatics, 6, 34. doi:10.1186/1471-2105-6-34
[50] Lee T.J., Pouliot, Y., Wagner, V., Gupta, P., Stringer-Calvert, D.W.J., Tenenbaum, J.D. and Karp, P.D. (2006) Biowarehouse: A bioinformatics database warehouse toolkit. BMC Bioinformatics, 7, 170. doi:10.1186/1471-2105-7-170
[51] Birkland, A. and Yona, G. (2006) BIOZON: A hub of heterogeneous biological data. Nucleic Acids Research, 34, D235-D242. doi:10.1093/nar/gkj153
[52] Trissl, S., Rother, K., Müller, H., Steinke, T., Koch, I., Preissner, R., Frömmel, C. and Leser, U. (2005) Columba: An integrated database of proteins, structures, and annotations. BMC Bioinformatics, 6, 81. doi:10.1186/1471-2105-6-81
[53] Hariharaputran, S., Töpel, T., Brockschmidt, B. and Hofestädt, R. (2007) VINEdb: A data warehouse for integration and interactive exploration of life science data. Journal of Integrative Bioinformatics, 4, 63.
[54] Haider, S., Ballester, B., Smedley, D., Zhang, J., Rice, P. and Kasprzyk, A. (2009) BioMart central portal-unified access to biological data. Nucleic Acids Research, 37, W23-W27. doi:10.1093/nar/gkp265
[55] Haas, L.M., Schwarz, P.M., Kodali, P., Kotlar, E., Rice, J.E. and Swope, W.C. (2001) DiscoveryLink: A system for integrated access to life sciences data sources. IBM Systems Journal, 40, 489-511. doi:10.1147/sj.402.0489
[56] Chung, S.Y., Wong, L. (1999) Kleisli: A new tool for data integration in biology. Trends in Biotechnology, 17, 351-355. doi:10.1016/S0167-7799(99)01342-6
[57] Hekkelman, M.L. and Vriend, G. (2005) MRS: A fast and compact retrieval system for biological data. Nucleic Acids Research, 33, W766-W769. doi:10.1093/nar/gki422
[58] Crasto, C.J. and Shepherd, G.M. (2007) Managing knowledge in neuroscience. Methods in Molecular Biology, 401, 3-21. doi:10.1007/978-1-59745-520-6_1
[59] Bota, M. and Swanson, L.W. (2010) Collating and curating neuroanatomical nomenclatures: Principles and use of the brain architecture knowledge management system (BAMS). Frontier in Neuroinformatics, 4, 3. doi:10.3389/fninf.2010.00003
[60] Cheung, K.H., White, K., Hager, J., Gerstein, M., Reinke, V., Nelson, K., et al. (2002) YMD: A microarray database for large-scale gene expression analysis. AMIA Annual Symposium Proceedings, 2002, 140-144.
[61] Zdobnov, E.M., Lopez, R., Apweiler, R. and Etzold T. (2002) The EBI SRS server-recent developments. Bioinformatics, 18, 368-373. doi:10.1093/bioinformatics/18.2.368
[62] Sigrist, C.J.A., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Bulliard, V., Bairoch, A. and Hulo, N. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research, 38, D161-D166. doi:10.1093/nar/gkp885
[63] BioMoby Consortium, Wilkinson, M.D., Senger, M., Kawas, E., Bruskiewich, R., et al. (2008) Interoperability with Moby 1.0—It’s better than sharing your toothbrush. Briefings in Bioinformatics, 9, 220-231. doi:10.1093/bib/bbn003
[64] Jenkinson, A.M., Albrecht, M., Birney, E., Blankenburg H., Down, T., et al. (2008) Integrating biological data— The Distributed Annotation System. BMC Bioinformatics, 9, S3. doi:10.1186/1471-2105-9-S8-S3
[65] Messina, D.N. and Sonnhammer, E.L. (2009) DASher: A stand-alone protein sequence client for DAS, the Distributed Annotation System. Bioinformatics, 25, 1333-1334. doi:10.1093/bioinformatics/btp153
[66] Olason, P.I. (2005) Integrating protein annotation resources through the Distributed Annotation System. Nucleic Acids Research, 33, W468-W470. doi:10.1093/nar/gki463
[67] Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., et al. (2004) Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20, 3045-3054. doi:10.1093/bioinformatics/bth361
[68] Hendler, J. (2003) Science and the semantic web. Science, 299, 520-521. doi:10.1126/science.1078874
[69] Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P. and Morissette, J. (2008) Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41, 706-716. doi:10.1016/j.jbi.2008.03.004
[70] Cheung, K.H., Yip, K.Y., Smith, A., Deknikker, R., Masiar, A., Gerstein, M. (2008) YeastHub: A semantic web use case for integrating data in the life sciences domain. Bioinformatics, 21, 85-96. doi:10.1093/bioinformatics/bti1026
[71] Ruttenberg, A., Clark, T., Bug, W., Samwald, M., Bodenreider, O., et al. (2007) Advancing translational research with the semantic web. BMC Bioinformatics, 8, S2. doi:10.1186/1471-2105-8-S3-S2
[72] Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L. and Nolan, G.P. (2010) Computational solutions to large-scale data management and analysis. Nature Reviews Genetics, 11, 647-657. doi:10.1038/nrg2857
[73] Wilkinson, M.D., McCarthy, L., Vandervalk, B., Withers, D., Kawas, E. and Samadian, S. (2010) SADI, SHARE, and the in silico scientific method. BMC Bioinformatics, 11, S7. doi:10.1186/1471-2105-11-S12-S7
[74] Lee, T.L. (2008) Big data: Open-source format needed to aid wiki collaboration. Nature, 455, 461. doi:10.1038/455461c
[75] Potthast, M., Stein, B. and Gerling, R. (2008) Automatic vandalism detection in Wikipedia. Advances in Information Retrieval, 4956, 663-668. doi:10.1007/978-3-540-78646-7_75
[76] Kislyuk, A.O., Katz, L.S., Agrawal, S., Hagen, M.S., Conley, A.B., et al. (2010) A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics, 26, 1819-1826. doi:10.1093/bioinformatics/btq284
[77] Li, L., Shiga, M., Ching, W.K. and Mamitsuka, H. (2010) Annotating gene functions with integrative spectral clustering on microarray expressions and sequences. Genome Information, 22, 95-120. doi:10.1142/9781848165786_0009
[78] Lorenzi, H.A., Puiu, D., Miller, J.R., Brinkac, L.M., Amedeo, P., Hall, N. and Caler, E.V. (2010) New assembly, reannotation and analysis of the entamoeba histolytica genome reveal new genomic features and protein content information. PLoS Neglected Tropical Diseases, 4, e716. doi:10.1371/journal.pntd.0000716
[79] Meyer, F., Goesmann, A., McHardy, A.C., Bartels, D., Bekel, T., et al. (2003) Gendb—An open source genome annotation system for prokaryote genomes. Nucleic Acids Research, 31, 2187-2195. doi:10.1093/nar/gkg312
[80] Stothard, P. and Wishart, D.S. (2006) Automated bacterial genome analysis and annotation. Current Opinion in Microbiology, 9, 505-510. doi:10.1016/j.mib.2006.08.002
[81] Stein, L. (2001) Genome annotation: From sequence to biology. Nature Review in Genetics, 2, 493-503. doi:10.1038/35080529
[82] Overbeek, R., Begley, T., Butler, R.M., Choudhuri, J.V., Chuang, H.Y., et al. (2005) The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Research, 33, 5691-5702. doi:10.1093/nar/gki866
[83] Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S. and Ouzounis, C.A. (2002) Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics, 18, 1641-1649. doi:10.1093/bioinformatics/18.12.1641
[84] Prosdocimi, F. (2003) Bioinformática: Manual do usuario. Biotecnologia Ciência & Desenvolvimento, 2, 2.
[85] Pareja, E., Pareja-Tobes, P., Manrique, M., Pareja-Tobes, E., Bonal, J. and Tobes, R. (2006) Extratrain: A database of extragenic regions and transcriptional information in prokaryotic organisms. BMC Microbiology, 6, 29. doi:10.1186/1471-2180-6-29
[86] Lerat, E. and Ochman, H. (2005) Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Research, 33, 3125-3132. doi:10.1093/nar/gki631
[87] Baxevanis, A.D. and Ouellette, F.F. (2001) A practical guide to the analysis of genes and proteins. Wiley: Bioinformatics, 2, 260-262.
[88] Mazumder, R. and Vasudevan, S. (2008) Structure-guided comparative analysis of proteins: Principles, tools, and applications for predicting function. PLoS Computational Biology, 4, e1000151. doi:10.1371/journal.pcbi.1000151
[89] Karasavvas, K.A., Baldock, R. and Burger, A. (2004) Bioinformatics integration and agent technology. Journal of Biomedical Informatics, 37, 205-219. doi:10.1016/j.jbi.2004.04.003
[90] Li, A. (2006) Facing the challenges of data integration in biosciences. Engineering Letter, 13, 3.
[91] Demir, E., Cary, M.P., Paley, S., Fukuda, K., Lemer C., et al. (2010) The BioPAX community standard for pathway data sharing. Nature Biotechnology, 28, 935-942. doi:10.1038/nbt.1666
[92] Rubin, D.L., Shah, N.H. and Noy, N.F. (2008) Biomedical ontologies: A functional perspective. Brief Bioinformatics, 9, 75-90. doi:10.1093/bib/bbm059
[93] Sarkar, I.N., Egan, M.G., Coruzzi, G., Lee, E.K. and De-Salle, R. (2008) Automated simultaneous analysis phylogenetics (ASAP): An enabling tool for phlyogenomics. BMC bioinformatics, 9, 103. doi:10.1186/1471-2105-9-103
[94] Clark, T. (2007) Knowledge integration in biomedicine: Technology and community. Brief Bioinformatics, 8, E1-E3. doi:10.1093/bib/bbm019

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.