Editorial, J Appl Bioinform Comput Biol Vol: 1 Issue: 1
New Era for Biocomputing
Momiao Xiong* | |
Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA | |
Corresponding author : Momiao Xiong Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030,USA E-mail: momiao.xiong@uth.tmc.edu |
|
Received: June 18, 2012 Accepted: June 19, 2012 Published: June 21, 2012 | |
Citation: Xiong M (2012) New Era for Biocomputing. J Biocomput 1:1. doi:10.4172/2329-9533.1000e102 |
Abstract
New Era for Biocomputing
Fast and cheaper Next Generation Sequencing (NGS) technologies will generate unprecedentedly massive (thousands or even ten thousands of individuals) and highly-dimensional (ten or even dozens of millions) genomic and epigenomic variation data that allow nearly complete evaluation of genomic and epigenomic variation including common and rare variants, insertion/deletion, CNVs, mRNA by sequencing (RNA-seq), microRNA by sequencing (mRNA-seq), methylation by sequencing (methylation-seq) and Chip-seq. Analysis of these extremely big and diverse types of data sets provide powerful tools to comprehensively understand the genome and epigenomes.
Keywords: Biocomputing
Fast and cheaper Next Generation Sequencing (NGS) technologies will generate unprecedentedly massive (thousands or even ten thousands of individuals) and highly-dimensional (ten or even dozens of millions) genomic and epigenomic variation data that allow nearly complete evaluation of genomic and epigenomic variation including common and rare variants, insertion/deletion, CNVs, mRNA by sequencing (RNA-seq), microRNA by sequencing (mRNA-seq), methylation by sequencing (methylation-seq) and Chip-seq. Analysis of these extremely big and diverse types of data sets provide powerful tools to comprehensively understand the genome and epigenomes [1] and hold promise of shifting the focus of health care from the disease to wellness where we record enormous amounts of personal data and monitor the individual wellness status [2]. But, the volume and complexity of sequences data in genomics and epigenomics, and real time measured health care data have begun to outpace the computing infrastructures used to calculate and store genomic, epigenomic and health monitor information [3,4]. Emergence of NGS technologies also pose great computational challenges of storing, transferring and analyzing large volumes of sequencing data [5,6] in comparative genomics [7], genome assembly and sequence analysis [8], metagenomics [9,10], proteomics [11], genetic studies of complex diseases [12-14] and biomedical image analysis [15]. | |
Innovative approaches should be developed to address these challenges. One option is to develop novel algorithms and methods to deal with new types of data. For example, the genomic and epigenomic data generated by NGS technologies first demand the changes of the concepts of genome. As Haldane [16] and Fisher [17] recognized in the last century, the genome can be modeled as a continuum. Specifically, the genome is not purely a collection of independent segregating sites. Rather, the genome is transmitted not in points, but in segments. Instead of modeling the genome as a few separated individual loci, modeling the genome as a continuum where the observed genetic variant function can be viewed as a realization of the stochastic process in the genome and modeled as a function of genomic location will enrich information on genetic variation across the genome. The new data technologies also demand the paradigm shift in genomic and epigenomic data analysis from standard multivariate data analysis to functional data analysis [18,19], from low dimensional data analysis to high dimensional data analysis [20,21], from independent sampling to dependent sampling [22], from single genomic or epigenomic variant analysis to integrated genomic and epigenomic analysis. | |
But, as Schatz et al. [3] pointed out, similar to scientific breakthroughs, algorithmic breakthroughs do not happen very often and are difficult to plan. A practical solution is to employ the power of parallel computing. Parallelism is the future of computing. Two types of popular parallel computing are Cloud computing and GPU (graphical process units) computing. | |
Cloud is a metaphor for the Internet. Cloud computing is a type of Internet-based computing. Users access computational resources from a vendor over the internet [3]. The cloud is virtualization technology [23]. It divides a server’s hardware resources into multiple ‘‘computer devices”, each running its own operating system in isolation from the other device which presents to the user as an entirely separate computer. A typical cloud computing begins by uploading data into the cloud storage, conducts computations on a cluster of virtual machines, output the results to the cloud storage and finally download the results back to the user’s local computer. Since the pool of computational resources available ‘in the cloud’ is huge, we have enough computational power to analyze large amount of data. The cloud computing has been applied to manage the deluge of ‘big sequence data’ in 1000 Genomes Project [6], comparative genomics [7], Chip-seq data analysis [24], translational medicine [25], transcriptome analysis [26], and disease risk management [27]. | |
Although cloud computing provides a powerful solution to big data analysis, it also has limitations. Cloud computing requires large data transfer over internet, and raises data privacy and security issues. Complementary to cloud computing is GPU computing [28]. GPU conducts the task parallelism and data parallelism of the application. It divides the resources of the processor in space. The output of the part of the processor working on one stage is directly fed into the input of a different part that works on the next stage. The hardware in any given stage could exploit data parallelism in which multiple elements are processed at the same time. The highly parallel GPU has higher memory bandwidth and more computing power than central processing units (CPUs). The GPU follows a single multiple-data (SPMD) programming model and processes many elements in parallel suing the same program. It consists of hundreds of smaller cores. They work together to boost their high computer performance. The GPU computing is getting the momentum in biomedical research. It has been applied to network analysis [29], RNA secondary structure prediction [30], gene-gene interaction analysis [13,14,31], biological pathway simulation [32], sequence analysis [33], gene prediction [34], motif identification [35], Metagenomics, protein analysis [36], and molecular dynamics simulations [37]. | |
The NGS technologies raises great expectations for new genomic end epigenomic knowledge that will translate into meaningful therapeutics and insights into health, but immense biomedical complexities make clinically meaningful new discoveries hidden within a deluge of high dimensional data and numerous number of analyses. Develop new analytic paradigm, novel statistical methods and explore the power of parallel computing for sequence-based genomic and epigenomic data analysis to overcome the serious limitation of the current paradigm and statistical methods for genomic and epigenomic data analysis. The Journal of Biocomputing provides excellent platforms to present new algorithm discovery and communicate novel ideas among Biocomputing communities. We can expect that the emergence of NGS technologies and new development in parallel computing, and publication of the Journal of Biocomputing will stimulate the development of innovative algorithms and novel paradigm for big genomic, epigenomic and clinical data analysis and open a new era for Biocomputing. | |
References |
|
|