MOJ

eISSN: 2374-6920

Proteomics & Bioinformatics

Abstrat

Human microbiome communities consist of variety of bacterial, fungi and archaea, which are integral part of the human body. These communities greatly vary from one part of the body to the other and help us maintain healthy environment. Most microbiome interacts with the host and each other via metabolic products, but how it interacts with various human metabolic pathways still remains unknown. Slight changes in the microbial communities could be important indicator of potential disease and can be served as potential Biomarker. For this study two separate specimens of human saliva microbial DNA short read sequences retrieved from the HMP (Human Metabiome Project).¹ After appropriate quality control, both sequences aligned against the KEGG database for identification of genes and known metabolic pathways these communities were involved. The Human Metabolic Reconstruction pipeline² was primarily developed to analytical process to perform metabolic reconstruction of pathways involved in human microbiome. Further information on statistical significance on each metabolic pathway analyzed and compared against both control and test Specimens. Finally 3M³ comparative visual analytics and manual curation capabilities were developed using Oracle Apex rapid web development technology.⁴ As a conclusion of this case study various surprising facts uncovered between both specimens. With the help of KEGG BRITE Metabolic Hierarchy⁵ pathways abundance/coverage with Orthology, Enzyme and chemical function visualized more effectively. Such type of comparative metagenomic studies performed on large pool of patient cohort can be beneficial to discover effective biomarker for the diagnosis and prognosis of various diseases.

Keywords: metagenomics, human, disease, HUMAnN, pipeline, visualization, annotation, microbial community, metabolic, reconstruction, kegg pathways

Abbreviations

KEGG, kyoto encyclopedia of genes and genomes; HMP, human metabiome project; MG-RAST, metagenomics RAST); 3M, metagenomic metabolic manual annotation; DIAG, data intensive academic grid; HUMAnN, HMP unified metabolic analysis network

Introduction

Human mouth is the first place to receive food and saliva, hence it’s potential target for external and internal microbial communities. All of these microbial communities compete for resources and survival. Exact nature of these communities and their interaction with human system related to diseases progression is still not completely understood. In search of answers to various questions such as “Who they are?” and “What they do?”, metagenomic comparative analysis were performed. Two separate human specimens^6,7 of human saliva microbial short reads sequences retrieved from the HMP (Human Metabiome Project).¹ Building a generic metagenomic pipeline for the pathway data analysis as well as data visualization platform was the major objective of this study. Understanding functional nature of microbial communities such as “How abundant they are?” and “What they are capable of doing?” will be essential for the conclusion of this study.

With the reducing cost of Next Generation Sequencing, now it’s possible to perform cost effective sequencing and analyze entire site specific microbial communities at once. Metabolic Reconstruction² is the NGS sequence based computational process for analysis of metabolic pathways and interactions within these localized communities. Reconstruction generates statistical significance of abundance and coverage of various metabolic pathways involved. Without actual information on RNA gene expression, it will be difficult to identify actual pathways involved, but using metagenomic sequence it is definitely possible to understand capability of these microbial communities. Distantly related orthologs species rarely demonstrates sequence identity, thus accuracy of alignment improved by NCBI blastxtool on KEGG (genes.pep) protein reference database.⁵ With known functionality of involved genes and related annotation from KEGG, further investigation performed to uncover enzymes and related chemical functionalities.

Materials and methods

Both HMP specimen sequences show in the Table 1 have demonstrated similar species abundance, which helped eliminate most common features and focus on critical variations’ within the pathways. NGS pipeline consists of six sections including sequence acquisition, cleansing, alignment, reconstruction, visualization and annotation. As a result of reconstruction process, the HuMAnN pipeline⁸ provided pathway abundance score and pathway coverage score for differential analysis. Both statistical indicators further utilized for visual analytics. Most backend processing including HMP sequence Acquisition, KEGG BRITE Hierarchy Alignment and HUNAnN Reconstruction is done within the DIAG computing platform,² while storage, visualization and annotation accomplished with the customized 3M Application hosted on Oracle Apex portal⁴ (Figure 1).

HMP	HMP	Trimmed	Blast	Blast	HUMAnN
Specimen#	Sequence	Filtered	Input	output	Output
4473347	5005604	4053766	440500	50875	355
4473378	8351741	4397314	675002	79669	253

Table 1 Sequence data volume

Figure 1 End to End Data Processing Flow.

Quality control

Existence of incomplete and noisy NGS sequences within the specimens will have adverse impact on data analysis and final interpretations.⁸ Hence short read sequences from FASTQ format were trimmed for repeated and low quality reads using TrimBWAstyle.pl.^9,10 Incomplete sequence filtered out based up on length with the help of remove_bad_seqs.py.¹⁰ Qualified sequences with length greater than 75bp were selected for further analysis. As a result of this rigorous cleansing operation around 81% short reads sequences were selected from the Specimen#1, while only 53% of the short reads from the Specimen#2 were retained.

Alignment

The KEGG database⁵ is the unique form of pre-curated database and provides identification of gene and related organism. This linkage was very critical for orthologous function, enzymatic reaction studied during the metabolic reconstruction. Hence HUMAnN pipeline required blast alignment for the orthologs against annotated genome from the KEGG genome. pepprotein database.⁵ The blastx hits on the KEGG reference database provided identity/similarity score along with the Organism and gene identifiers for the further metabolic reconstruction. The reconstruction also required aligned data in the tab delimited format, hence blastx parameters were tuned accordingly. As a result of quality control and alignment with the KEGG database, overall sequence volume further reduced significantly.

Reconstruction

The HUMAnN (HMP Unified Metabolic Analysis Network) pipeline¹¹ is designed to identify metabolic pathway or module abundance (presence/absence) and their relative metabolic pathway coverage. Typically reconstruction begins with the aligned gene identifiers and achieve objective with following five steps. Visithttp://huttenhower.sph.harvard.edu/humann for the additional details on HUMAnN pipeline.²

Reads weighted for the quality of the matches to calculate abundance of orthologs gene family.
The MinPath algorithm is used to identify gene family to the metabolic pathway and modules.
Based up on taxonomic composition filter out false positive pathways identified by MinPath.
Fill up the gap in the pathway produced due to sequencing error or low abundance gene.

Finally assign coverage score and abundance score to the metabolic pathway and module.

Pathway Hierarchy

Interpretation of metabolic pathway functionality is not possible without integrating reconstruction data with pathway hierarchy retrieved from KEGG Brite.⁵ Hence curated information including pathway ID (koID), modules, orthologues (KID) and enzyme (EC#) also uploaded into the 3M Visualization Applicatio.³ The 3M application also enhanced to customize pathway hierarchy as needed. Depend up on the experiment requirement the reconstruction step can be repeated for each Specimens such as control or test and visually analyze pathways involved in relative significance (Figure 2).

Figure 2 End to End Data Processing Flow.

Visualization

Actual comparative metagenomics visualization process began after uploading both specimens information into the custom built Metagenomic Metabolic Manual (3M) Annotation Application.³ This web application can be accessed with (dev/dev) credential at the Apex URL http://apex.oracle.com/pls/apex/f?p=689131. For cost effective solution and ease of maintenance visualization and annotation screens were developed based up on Oracle Apex⁴ rapid development framework. Access to the Apex web development framework is free for the non-production application like 3M. Since this web development technology is also included within the actual Oracle database license, it will be sustainable cost effective alternative for the web development.

Results

With the help of pathway significance and curated KEGG pathway hierarchy, now user can perform data visualization from various aspects. Following are some of the observations identified during visualization of metagenomic specimen received after reconstruction. For simplicity user selection of control and test specimen data for visualization will be retained during the active user session.

Top 10 Pathways

Visualization of top 10 pathways from both Specimens indicated identical pathway abundance, except “One carbon pool by folate” on Specimen1 while “DNA Replication” on Specimen2 was major differences among both specimens.

Sided by side comparison

As per the pathway coverage is concern, gap between both Specimens widens drastically. Especially Sulfur Relay System (ko04122) in Specimen1 appeared as the largest gap in the coverage might explain dominant Sulfur metabolism function of the Specimen1 community.

Highest differences

Significant differences among specimens observed within the C5-Branched dibasic acid metabolism (ko00660) pathway. Similar difference also noticed within the Lipoic acid metabolism (ko00785) pathways, while flagellar assembly (ko02040) and Sulfur relay system (ko04122) pathways were indicator of behavior of the microbial communities.

Least differences

As per the least difference in the pathway coverage, observed 10 pathways without any significant difference. Abundance comparison also found Alzheimer's disease (ko05010) pathways unrelated to the microbial community. During annotation phase, 3M annotation can easily identify and eliminate this pathway. Since both Specimen collected from Human saliva sample⁷ possibility of human genomic sequences contamination cannot be ruled out.

Missing pathways

No missing pathways appeared during abundance analysis from either side of the Specimens. While from pathway coverage point of view, observed three missing pathways from Specimen1 but found in Specimen2, including D-Glutamine and D-glutamate metabolism (ko00471), SNARE interactions in vesicular transport (ko04130) and Spliceosome (ko03040). However there are 37 pathways missing from Specimen2 mostly related to human metabolic pathways.

Correlationship

The visualization 3M application also designed to calculate correlation coefficient-r and compare it against t-table for the possible relationship between Specimen1 and Specimen2 scores.^6,7 After establishing relationship between matching pair among 147 metabolic pathways, calculated coefficient r=0.9754 indicating significant correlation between both Specimens pathway abundance. But pathway coverage with 100 metabolic pathways could able to calculate r coefficient of 0.8798, indicating less correlation compared to the pathway abundance.

Pathway hierarchy

Side by side comparison of the pathway abundance also observed existence of significant amount of differences in the pathways among Specimen1 and Specimen2.^6,7 Few noticeable pathways observed within the Specimen1 only related to the Human metabolic pathways. Similar conclusion also interpreted within the Pathway coverage hierarchy for both Specimens.

Manual annotation

The 3M visualization application³ enhanced to enable user to perform curation on the reconstruction data as well as pathway hierarchy. Hence visualization supported various Annotation capabilities including upload HUMAnN pipeline data, modify, delete, uploaded data, maintain KEGG Pathway related Orthology and maintain relationship between module and related KEGG Orthology and enzyme functions. With the help of “HUMAnN Data” tab, user can able to inactivate specific unrelated pathways abundance and coverage score. Thus user can effectively eliminate any outlier pathways from further analysis.

Discussion

Even though both human saliva specimens contain similar representation of the microbial communities, both Specimens are different from the comparative metagenomic potential point of view. However top three species including Firmicutes, Proteobacteria and Bacteriodetes were predominantly found within both specimens. Most of the pathways mentioned earlier in the result section are possible genomic pathways and may not be actually expressed in the microbiome. Only proteomics and transcriptomic studies may able to conclude actual pathway abundance and coverage within cellular environment. This information might help us in the future to understand functional nature of the community and possibly prepare comparative metagenomic profile and possible biomarker indicative of potential diseases such as Periodontal Disease.¹²

Interpretations

Further study of top ten pathways identified abundance of most common pathways such as Ribosome, lipopolysaccharides, Valine, Leuicine, Isoluicine, Thymine, Alanine and Peptidoglycan biosynthesis pathways. Abundance of Sulfur Relay System (ko04122) pathway within Specimen1 also leads us to possible microbial community function of sulfur metabolism, cell proliferation, apoptosis, and DNA repair. Abundance of Flagellar assembly (ko02040) within Specimen1 also indicates that it contains abundant bacterial communities which utilize flagellum for its locomotion. Analysis of the missing pathways remains inconclusive, but it clearly suggested that we did not acquire sufficient amount of Specimen2 short read sequences to perform metabolic reconstruction. Even though pathway abundance correlates nicely (r=0.9754), but pathway coverage remains less correlated (r=0.8798). Surprisingly only three missing pathway coverage from both specimens strongly suggest existence of certain pathways pattern within the human saliva specimens. Such type of pathway profiles on ecosystem can be very useful for clinical diagnosis and prognosis.

Future enhancements

NCBI Blastx alignment definitely improved accuracy of alignment, but also introduced performance implications within the reconstruction pipeline. In the future pipeline can be enhanced with bowtie2.
Most of the alignment jobs were submitted to DIAG HPC cluster¹³ via “qsub” command, In future pipeline throughput can be enhanced by Hadoop distributed data and processing platform.
For simplicity, 3M visualization application only supported two specimens (test and control) for the comparative study. In the future 3M application can be enhance to perform multi specimen analysis.
The reconstruction pipeline integrates backend (DIAG pipeline) and frontend (Oracle Apex) 3M applications and exchange data via tab delimited files. In the future direct database connectivity to visualization tool will reduce manual overhead.
The 3M application can be enhanced to perform series of iterative automated and manual annotation until reaching out to the desired result quality.

Conclusion

Microbial species distribution shown in the Table 2 contains various probiotic species such as Firmicutes, Proteobacteria, Actinobacteria, Bacteroidetes and Fusobacteria. No significant source of pathogenic species such as Streptococcus-pyogenes found within either of the human saliva specimens. Various studies¹² linked periodontal/Gum disease to the Streptococcus-pyogenes bacteria within sub-gingival plaque communities. Invasion of periodontal pathogenic species into the blood stream has been associated with tooth decay, chronic vascular disease and stroke.¹² As periodontal disease progress, patient’s treatment options also confines. Thus complicated and costly procedures can be avoided by detecting shift in the microbial community profile early in the disease progression.

Species	Specimen1 Population SRR062371- 4473347.3	Specimen2 Population SRR062402-4473378.3
Firmicutes	2022849	3359405
Proteobacteria	1000866	1077421
Bacteroidetes	658657	2514329
Actinobacteria	534319	265861
Fusobacteria	83655	217717
Chordata	36145	41367
Fibrobacteres	31905	44676
Cyanobacteria	27592	46302
Ascomycota	24399	42809
Spirochaetes	23074	42138

Table 2 Top 10 Metagenomic species distribution

During specimen collection phase of Specimen2, HMP could not able to filter out human cells from the microbial cells. Once again this study emphasize on importance of the microbial specimen cell preparation and filtration procedures. Lesson learned and pathway gaps identified from this study can be further curated by 3M Annotation functionality and reprocessed via HUMAnN pipeline. Finally this study concludes with small step towards the potential application of comparative metagenomic to define reference ranges for healthy human subjects. Such type of reference data bases should help researchers distinguish pathogenic state of the human microbial communities. Further research in the functional nature of microbial communities may lead to the future state of diagnosis, prognosis and personalized medicine biomarker.

Acknowledgements

I would like to express sincere gratitude towards the Johns Hopkins Prof. Joshua Orvis, who encourage me to think out of box and take challenging tasks like this one. I am also thankful for the DIAG team for making such a great computational resources available for most of the analytical data processing. Without support from the HUMAnN team at the Hutten hower Lab Department of Biostatistics, Harvard School of Public Health this study will not be possible. Finally special thanks to HMP, MG-RAST, KEGG and Oracle.