Advancing our understanding of the soil microbial communities using QIIME software: a 16S data analysis pipeline

doi:10.15406/jabb.2017.04.00106

Journal of

eISSN: 2572-8466

Applied Biotechnology & Bioengineering

Research Article Volume 4 Issue 3

Advancing our understanding of the soil microbial communities using QIIME software: a 16S data analysis pipeline

Chandan Badapanda,^1,2

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Ruchi Rani,² Ganesh Chandra Sahoo¹

¹Department of Biotechnology, OPJS University, India
²Department of Bioinformatics, Xcelris Labs Limited, India

Correspondence: Chandan Badapanda, Xcelris Labs Limited, Ahmedabad, Gujarat, India, Tel 91-79-66092177, Fax 91-79-66309341

Received: October 30, 2017 | Published: December 20, 2017

Citation: Badapanda C, Rani R, Sahoo GC. Advancing our understanding of the soil microbial communities using QIIME software: a 16S data analysis pipeline. J Appl Biotechnol Bioeng. 2017;4(3):610-616. DOI: 10.15406/jabb.2017.04.00106

Download PDF

Abstract

QIIME (Quantitative Insights Into Microbial Ecology) is one of the most popular open-source bioinformatics suite for performing 16S rRNA as well as Internal Transcribed Spacer (ITS) data analysis. The challenge that has frustrated microbiologists for decades is how to access the microorganisms that cannot be cultured in the laboratory and NGS (Next Generation Sequencing) platforms provide an additional set of tools to study uncultured species. Here, QIIME pipeline is implemented on soil samples to discover the microbial communities that exist in the soil where Rhazya stricta (Apocynaceae) is grown. The data sets of five soil samples were downloaded from NCBI-SRA for analysis through QIIME. The NGS technologies are very promising for resolving the immense soil 16S rRNA gene bacterial diversity and the pipeline implemented in this work can also be used for other bacterial diversity studies apart from the soil sample. Also our aim is to make use of the data which is often buried in supplementary data or SRA database that can be used to derive meaningful conclusion in metagenomics study through integrated bioinformatics approaches.

Keywords: 16S rRNA, data analysis, QIIME, bioinformatics, NGS, metagenomics, ITS

Introduction

Rhazya stricta (Apocynaceae) is an important medicinal plant found commonly in South Asia (Pakistan, India and Afghanistan) and the Middle East (i.e. Saudi Arabia, Qatar, United Arab Emirates (UAE), Iran and Iraq). It is used in local folk medicine to treat many diseases such as diabetes mellitus, certain inflammatory conditions, and helminthiasis because leaves of Rhazya stricta contains more than 100 alkaloids with diverse pharmacological properties.¹ Furthermore, as a perennial, it is likely that the associated microflora in the rhizosphere and on the leaves is controlled by the plant itself via these gene products and compounds. Plants use an array of secondary metabolites to defend themselves against harmful organisms and to attract others that are beneficial.²

The soil microbiota harbours thousands of bacteria, archaeal, and eukaryotic taxa and these microorganisms play critical roles in regulating soil fertility, plant health and controlling biogeochemical cycling of elements essential for life.³ To identify cultured fungal isolates associated with Rhazya stricta in the rhizosphere and soil in the vicinity of the plant from different sites of Saudi Arabia, a 18S and DNA barcoding study was performed.² A few studies on genomic and transcriptomics of the plant Rhazya stricta were also done^4,5 but so far to our knowledge no report has been published on the 16S data analysis profile of the soil sample where Rhazya stricta is grown. The microflora of desert soil is highly dependent on the temperature, moisture and the presence of organic carbon.⁶ In this article, QIIME (Quantitative Insights into Microbial Ecology)⁷ Software Package is implemented to address two things

To decipher the microbial community present in the hot condition of Arabic soil where Rhazya stricta is grown
How the QIIME pipeline can be used to analyze 16S rRNA gene sequences from any microbial communities

Method

Soil near Rhazya stricta plants were taken as site for collecting the samples and in the manuscript we have renamed the samples as Sample A, Sample B, Sample C, Sample D and Sample E. Rhazya stricta (Apocynaceae) is a small, desert shrub found in Saudi Arabia and some other Asian countries. The total DNA was isolated from different soil sites around Rhazya stricta plant located in Saudi Arabia (N21°26.4’, E39°31.8’), these sites contain extensive Rhazya stricta plants. Soil DNA isolation was done by Power Soil DNA isolation kit as per the details under the NCBI Bioproject Id PRJEB8340. The Sequencing was carried out using Illumina MiSeq platform using 2 x150bp sequencing chemistry. An analysis of the deep sequences generated from the 16S V4 rRNA region amplicons was carried out to identify the taxonomic positions as well as the phylogenetic across five samples through the software package QIIME.⁷

Data analysis

For the study of diversity of microbial community present in rhizosphere of Rhazya stricta, QIIME-1.8.0⁷ software was used. Trimmomatic (V0.36)⁸ was used for quality filtration and filtration was done on quality value >25 to get high quality (HQ) data. Paired end data were given as input in QIIME for the identification of non-chimeric sequences which is subsequently used in the downstream analysis process. Operational taxonomy unit (OTU) were assigned to similar sequences, for this UCLUST algorithm⁹ was used at sequence similarity threshold of 97% and greengenes database¹⁰ was used as the reference database for picking up OTUs, followed by picking up the representative OTUs. Taxonomy was assigned to each sequence, using BLAST tool¹¹ assignment method and the greengenes database was used as the reference database at 90% similarity. Diversity between samples (“beta diversity”) was measured with UniFrac distances and evaluated using Principal Coordinate Analysis (PCoA) plots. The detail of the 16S data analysis is also previously described by our group¹² and the workflow for 16S rRNA data analysis is provided in Figure 1.

Figure 1 A bioinformatics workflow for 16s data analysis.

Results and discussion

A total of 5710165 16S rRNA sequences from the five soil samples (Sample A, Sample B, Sample C, Sample D and Sample E) were taken for the QIIME analysis. The details of Sample A, Sample B, Sample C, Sample D and Sample E with their SRA submitted ID, amount of high quality data is provided in Table 1.

Sample Name	SRA ID	Raw Reads	HQ reads
Sample A	ERR747886	952991	945600
Sample B	ERR747887	1612304	1599742
Sample C	ERR747888	1506361	1496742
Sample D	ERR747889	1525294	1514090
Sample E	ERR747891	154901	153991

Table 1 Data statistics of Sample A, Sample B, Sample C, Sample D and Sample E

A total of 142971 OTUs were assigned at 97% identity for all 5 samples taking Greengenes database¹⁰ as a reference database through QIIME pipeline.⁷ Through QIIME, Actinobacteria, Proteobacteria followed by Acidobacter were found to be the most abundant phyla across five samples (Sample A, Sample B, Sample C, Sample D and Sample E). At phylum level, 22.9%, 24.3%, 29.5% and 16.2% of OTUs were assigned with Actinobacteria for Sample A, Sample B, Sample D and Sample E respectively, whereas 34.3% of OTUs were assigned with Actinobacteria for Sample C. Actinobacteria have significant biogeochemical roles in terrestrial soils as they have the ability to produce biologically-active secondary metabolites and almost 16,500 compounds have been reported to show antibacterial property against pathogenic bacteria.^13,14 Secondly, Proteobacteria was found to be 22.2%, 17.2%, 19.7% and 23.4% in Sample A, Sample B, Sample D and Sample E where as Sample C was enriched with 38.3% with Proteobacteria. Actinomycetes (phylum-Actinobacteria) is found to be the most abundant bacterial phylum in desert soil.⁶ Apart from above, various other phyla such as Chloroflexi, Firmicutes, Gemmatimonadetes, Verrucomicrobia, Cyanobacteria were also identified to be present in all the five samples. A similar pattern of abundant phyla were reported previously from the desert soil and a variety of soil samples from different regions of the globe.^3,15

Archaeal taxa were relatively rare across many biomes but abundantly present in hot desert soils and Thaumarchaeota being the principal representative archaeal group in nearly all soil samples reported previously.^3,16 Two archaeal phyla, Euryarchaeota and Crenarchaeota were found to be present in all the five samples (Sample A, Sample B, Sample C, Sample D and Sample E). Euryarchaeota was found to be most abundant group with 9.82%, 6.81%, 0.14%, 13% and 8.58% OTUs in Sample A, Sample B, Sample C, Sample D and Sample E respectively. Crenarchaeota were found to be second most abundant group representing 2.40%, 1.64%, 0.05%, 0.89% and 1.94% OTUs in Sample A, Sample B, Sample C, Sample D and Sample E respectively. All Crenarchaeota belong to Thaumarchaeota and all Euryarchaeota belong to Thermoplasmata at class level and found to be present in all the five samples (Sample A, Sample B, Sample C, Sample D and Sample E). The similar pattern was also observed in hot desert soil samples reported previously.¹⁷

A heatmap is generated across Sample A, Sample B, Sample C, Sample D and Sample E in Figure 2 showing the comparative analysis at Phylum level through MetaPhIAn.¹⁸ In heatmap, each row corresponds to an OTU, and each column corresponds to a sample. The higher the abundance of an OTU in a sample, the more intense is the color at the corresponding position in the heatmap. By default, the OTUs (rows) were clustered by UPGMA hierarchical clustering, and the samples (columns) were presented in the order in which they appear in the OTU table. Figure 3 represents the heatmap across Sample A, Sample B, Sample C, Sample D and Sample E and their comparison at genus level. For interactive visualization of results Krona graph was generated at class level for Sample A, Sample B, Sample C, Sample D and Sample E. From Figure 4, it is observed that Sample C was enriched with 33.2% of Actinobacteria and 28.6% of Alphaproteobacteria at class level and found to be dominant as compared to other samples. In Sample D, 12.9% of OTUs were enriched with Thermoplasta and 6.9% were enriched with Thermoleophila. The Krona graph¹⁹ was generated in Figure 4 for Sample A, Sample B, Sample C, Sample D and Sample E respectively at class level.

Figure 2 A heatmap showing the abundance of each phylum within each microbial community is drawn through MetaPhIAn for Sample A, Sample B, Sample C, Sample D and Sample E.

Figure 3 An OTU table heat map showing taxonomy assignment for each OTU. The OTU heatmap displays raw OTU counts per sample, where the counts are colored based on the contribution of each OTU to the total OTU count present in that sample (blue: contributes low percentage of OTUs to sample; red/yellow: contributes high percentage of OTUs). Heatmap generated at genus level.

Figure 4 Represents the Krona graph for taxonomy assignment at class level for

Sample A
Sample B
Sample C
Sample D and
Sample E respectively.

Diversity analysis

Alpha diversity

Alpha diversity within the five soil samples were used to depict evenness or richness of lineages present in all the samples.^3,7,12 In all samples the total observed species was found in a range of 18000 to 108000. Sample A was found to have highest number of observed species (108333) whereas Sample D was found having lowest number of observed species (15010). No significant differences were observed in the alpha diversity indices across five samples and Table 2 represents the alpha diversity of all five samples using different matrices (shannon, PD Whole tree (Phylogenetic diversity whole tree), chao1, observed species).

Sample Name	Shannon	PD whole tree	Chao1	Observed Species
Sample A	12.92365	2188.898	166462	108333
Sample B	11.65946	844.2823	45335.98	33407
Sample C	12.55557	2197.672	105938.6	74910
Sample D	11.3913	424.9066	24184.47	15010
Sample E	8.902856	430.9604	30922.92	18534

Table 2 Alpha diversity calculations for Sample A, Sample B, Sample C, Sample D and Sample E

Beta diversity

Beta diversity depict the dissimilarity between the samples.^3,7,12 We had calculated the bray-curtis distances among five samples. Table 3 having the distances between the samples. From the PcoA analysis, it can be inferred that Sample C found to be most distant sample from Sample A, Sample B, Sample D and Sample E. Unweighted UniFrac PcoA (principal coordinates analysis) plots were used to show the relationship between soil samples.¹³ PcoA plot for Sample A, Sample B, Sample C, Sample D and Sample E is provided in the Figure 5 and phylogenetic analysis among five samples is shown in Figure 6. From the above analysis (beta diversity, PcoA and Phylogenetic study), it can be inferred that sample C is having different taxonomic composition as compared to sample A, B, D, E.

	Sample B	Sample A	Sample D	Sample E	Sample C
Sample B	0	45504.04	51291.19	60148.3	133838.3
Sample A	45504.04	0	36322.29	29534.51	122924.5
Sample D	51291.19	36322.29	0	53226.45	128100.9
Sample E	60148.3	29534.51	53226.45	0	121025.2
Sample C	133838.3	122924.5	128100.9	121025.2	0

Table 3 Beta diversity between Sample A, Sample B, Sample C, Sample D and Sample E

Figure 5 Principle Coordinate Analysis (PCoA) Plot from unweighted UniFrac distance of Sample A, Sample B, Sample C, Sample D and Sample E.

Figure 6 A visualization of bootstrap-supported hierarchical cluster tree using unweighted uniFrac distance matrix that was used for the PcoA plot for the Sample A, Sample B, Sample C, Sample D and Sample E. The figure shows the tree with internal nodes colored, red for 75-100% support, yellow for 50-75%, green for 25-50%, and blue for <25% support. The bar at the bottom of the tree indicates a length corresponding to 0.1 nucleotide substitutions per site.

Conclusion

16S rRNA sequence analysis performed in this study contributed to the thermo tolerance microbiota present in the soil of Rhazya stricta, an arid land, perennial evergreen shrub commonly found in the Arabian Peninsula and Indian subcontinent. At phylum level, Actinobacteria, Proteobacteria followed by Acidobacter were found to be the most abundant phyla across Sample A, Sample B, Sample C, Sample D and Sample E.

The observed species calculated from alpha diversity varies from 15010 to 108333 OTUs between five samples. From the PcoA analysis, it can be inferred that Sample C was found to be the most distant sample from Sample A, Sample B, Sample D and Sample E. This result is also in line with the phylogenetic analysis as Sample C was found to be the out group as compared to Sample A, Sample B, Sample D and Sample E.

The QIIME pipeline implemented for the bacterial diversity analysis of five samples in this article, can also be used for any other bacterial data analysis from any environmental sample. In conclusion, our aim is to make use of the data which is often buried in supplementary data or SRA database or other public resources and can be used to derive meaningful conclusion in metagenomics study through integrated bioinformatics approaches.