Opinion Volume 1 Issue 1
Department of Oncology, Stem cells and Nanomedicine, Fluorotronics Inc., USA
Correspondence: Farid Menaa, Department of Oncology, Stem cells and Nanomedicine, Fluorotronics Inc., 2453 Cades Way, Vista, CA 92081, USA
Received: May 01, 2014 | Published: May 24, 2014
Citation: Menaa F. Next-generation sequencing or the dilemma of large-scale data analysis: opportunities, insights, and challenges to translational, preventive and personalized medicine. J Investig Genomics. 2014;1(1):27-29. DOI: 10.15406/jig.2014.01.00005
Over the past years, the advent of Next-Generation Sequencing (NGS) technology, also known as high-throughput sequencing (HTS), has represented an immense hope for all of us. Indeed, NGS has (r)evolutionized the fields of molecular biology, genetics and genomics, enabling cost-effective and quick generation of DNA and RNA sequence data with exquisite accuracy and resolution for possible translational, preventive and personalized medicine. Nevertheless, in spite of tremendous advancements to broaden NGS applications from research to clinic, NGS still presents enormous challenges in terms of data storage, processing, quality control management and interpretation, which slow the translation from the bench-top to the bed-side.
In this expert-opinion article, I first summarize the main doubts about NGS technology according to my experiences in the field, which actually could open-up new opportunities for innovative research and development. I further highlight the general technological and methodological characteristics of NGS as well as the recent advances and challenges in terms of clinical investigations and applications toward the development of theranostics. Eventually, I briefly question the relevance of integrating NGS with other platforms such as next-generation proteomics (NGP) to optimize the prognosis, diagnosis and therapeutic options.
Keywords: next-generation sequencing, dna-seq, rna-seq, next-generation proteomics, high-throughput screening, theranostics, molecular diagnosis, genetic-based prognosis, personalized medicine, translational medicine, innovative technologies
NGS, next-generation sequencing; HTS, high-throughput sequencing; NGP, next-generation proteomics; SNPs, single-nucleotide polymorphisms, CNVs, copy number variations
Demand for faster DNA sequencing data than Sanger sequencing methods, which was used to sequence the first human genome, has led to NGS technologies.1,2 NGS allows genomes analyses, including those representing complex disease states such as cancers3‒6 or hematological disorders.7 In many cases, NGS, and so integrative genomics, can solve unmet clinical needs in the diagnosis, prediction of prognosis, monitoring the status of the disease and personalized treatment decision.8‒10 In fact, NGS is (R) evolutionizing the genetic field, bringing both hopes and doubts for many people including scientists, researchers, health practitioners, and patients. Thereby, high volume of data obtained from NGS nucleic acids sequencing instruments (DNA-seq, RNA-seq) leads to several new kinds of experiments, new questions amenable to study and new challenges necessary to efficient translational medicine.11 Indeed, how do we get from a collection of several million short sequences of bases to genome-scale results? Can we accurately translate these data from bench-top to bed-side? Can we reliably use them for preventive and personalized medicine? How important is to consider ethnical genetic diversity for genomic data interpretation? How long shall we keep the data and the samples of a given patient, considering possible spontaneous or induced genomic alterations? What are the risks for the patient to have his whole nucleic acid sequenced? What are the ethical challenges? How the static genomic information can be interpreted in the dynamic molecular world? What genetic information can be considered as a driver (disease) event? Shall we focus on homogeneous cellular/tissues subgroups or shall we use heterogeneous biological material (e.g. pool of cells, whole tissue) to identify genetic aberrations as therapeutic targets? What is the best sequencing platform and methodology to use? What is the cost of performing an accurate NGS of our genome? Are meta-analyses from both epidemiological and genomic studies required? What reference(s) one shall use to interpret NGS data? What is the confidence level of getting true information? How to fill all the gaps to obtain more confident information? Can we rely only on genome sequencing data? Shouldn´t we consider the combination of DNA-NGS, RNA-NGS and Next-Generation Proteomics (NGP) data analyses to obtain a better comprehensive view of the molecular dynamics and develop more accurate theranostics?
While much discussion focuses on rapidly sequencing human genomes at a low cost, the grail of personalized genomics, a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of “omics” have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry. DNA NGS is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries (e.g. mutations, single-nucleotide polymorphisms (SNPs), copy number variations (CNVs), methylation and/or acetylation sites) and draw experimental conclusions.
In practice, NGS data analysis process involves three stages. At the first stage, i.e. primary data analysis, image data are converted to sequence data. The sequence data (reads) can be in familiar “ACTG” sequence space or less familiar color space (SOLiD) or flow space. Primary analyses also provide quality values for each base that are used in subsequent phases of analysis, much like Phred quality values were used in Sanger sequencing. In the middle stage, i.e. secondary data analysis, datasets are created. Sequences from the primary data are aligned to reference data (e.g. complete genomes, subsets of genomic data like expressed genes, individual chromosomes) to create application-specific data sets for each sample. Presently, there is a large and growing list of alignment programs that can be used for secondary data analysis. In the final stage, i.e. tertiary data analysis, the data sets are compared to create experiment-specific results. This phase may involve a simple activity, like viewing a dataset in genome browser and using the frequency of tags to identify promoter sites or patterns of variation. Other experiments, like digital gene expression, include tertiary analyses where datasets are compared to each other, as it is done with microarray data. These kinds of analyses are the most complex: expression measurements need to be normalized between datasets and statistical comparisons need to be made to assess differences. Currently, the software for the primary analyses is provided by the instrument manufacturers and handled within the instrument itself, and when it comes to the tertiary analyses, many good tools already exist. However, between the primary and tertiary analyses lies a gap, but emerging studies reported advanced strategies and showed that NGS brings more robustness, resolution and inter-lab portability than microarray platforms.12 Thereby, robust mutation detection can be obtained by NGS assays if the data can be processed in a way (e.g. using artificial amplicon data sets) that does not lead to large genomic alterations landing in the unmapped data (i.e. trash).13
In RNA-Seq, the process of determining relative gene expression means that sequence data from multiple samples must go through the entire process of primary, secondary, and tertiary analysis. To do this work, researchers must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with steps that produce complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. Solutions to these challenges that close the gaps between primary, secondary, and tertiary analyses by showing results from a complete workflow system that includes data collection, processing and analysis for expression analysis are being developed. NGS is an attractive option for analyzing a transcriptome because the vast numbers of reads that can be obtained along with their sequences provide a highly sensitive way to evaluate the RNA population inside of a cell.14 In addition to rRNA, tRNA, and mRNA, assays are also measuring non-coding RNA and multiple classes of small RNAs (e.g. miRNA), but not without risks of biases.15‒17 As one obtains deeper information, largely through NGS, one learns that even mRNA is more complicated than previously thought. New reports indicate that 92-97% of human genes undergo alternative splicing.18,19 A common goal for these assays is to map the structure of genes in terms of their start sites, 5’ and 3’ ends, exons, splice junctions, polyA sites, and alternative forms, and quantify the relative abundance of different molecules under different conditions or developmental stages. When considered in an NGS context, transcriptome analysis breaks into categories of experiments defined by different procedures and analysis paths. Despite the widespread utilization of NGS, a major bottleneck in the implementation and capitalization of this technology therefore remains in the data processing steps.13 Further, the brisk evolution of sequencing technologies has flooded the market with commercially available sequencing platforms, whose unique chemistries and diverse applications stand as another obstacle restricting the potential of NGS for clinical applications.2 Importantly, large consortium-based sequencing studies (e.g. candidate gene studies, genome-wide association studies, and whole-genome admixture-based approaches that account for ancestral genetic structure, complex haplotypes, gene-gene interactions, and rare variants to detect and replicate novel pharmacogenetic loci) are using next-generation whole-genome sequencing to provide a diverse genome map of different admixed populations, which can be used for future pharmacogenetic studies.10 Therefore, it is time to work together more closely and move forward with awareness and holistic knowledge of the NGS capabilities and applications to the clinical realm. Interestingly, strategies for addressing the challenges of implementing genomic medicine in the clinical setting with more accuracy are emerging through best practices for integrating genomic findings into the electronic health record (e.g. eMERGE network).20
Eventually, in complement to DNA-seq and RNA-seq, and because the causes of most disorders are multi-factorial, another system-level approach to be considered, for a more comprehensive understanding of human biological complexity, is to integrate a view of proteome dynamics, possibly using MS-based proteomics.21 It seems that it is never too early to think about the biggest challenges for more accurate, more efficient and safer molecular-based medicine!..
NGS constitutes a major breakthrough in genomic research. Despite the advantages of NGS platforms related to the HTS rate and cost-effectiveness, the assembly of the reads produced by the current next-generation sequencers still remains a major challenge to faster translation to clinic. Maybe a layered architecture approach for constructing a general assembler that can handle the sequences generated by the different available sequencing platforms is a potent solution? Also, shall we already think to implement or integrate NGS with other technologies for a common work at the different system-levels?
None.
Author declares that there is no conflict of interest.
©2014 Menaa. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.