Suitability and advantages of abundance-weighted algorithms for the holistic comparison of sample systems

doi:10.15406/jabb.2017.02.00050

Journal of

eISSN: 2572-8466

Applied Biotechnology & Bioengineering

Review Article Volume 2 Issue 6

Suitability and advantages of abundance-weighted algorithms for the holistic comparison of sample systems

Jia Shi Zhu,¹

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Lu Qun Ni,^2,3 Yi Sang Yao,² Yu Ling Li,^4,5 Wei Dong Xie²

¹Department of Mathematics, University of California San Diego, USA
²Department of Applied Biology and Chemistry Technology, The Hong Kong Polytechnic University, Hong Kong
³Division of Life Sciences and Health, Tsinghua University Graduate School at Shenzhen, China
⁴Qinghai Academy of Animal Husbandry and Veterinary Sciences, Qinghai University, China
⁵State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University, China

Correspondence: Jia-Shi Zhu, Hong Kong Polytechnic University, Hong Kong, Tel 1(858)705-3790

Received: February 26, 2017 | Published: April 17, 2017

Citation: Lu-Qun Ni, Jia-Shi Zhu, Yi-Sang Yao, et al. Suitability and advantages of abundance-weighted algorithms for the holistic comparison of sample systems. J Appl Biotechnol Bioeng. 2017;2(6):230-235. DOI: 10.15406/jabb.2017.02.00050

Download PDF

Abstract

Numerous microcosmic research techniques have been available for examinations of individual chemicals (DNA, RNA, proteins, carbohydrates and other small-molecule chemicals) in sample systems. However, the results generated from many of the microcosmic studies have fueled speculations, hypotheses and non-conclusive debates in the study of systematics, similar to blind men each touching a portion of an elephant. To address this shortcoming, macrocosmic holistic techniques exhibit advantages in profiling and systematically comparing genetic diversity and genetic distance between sample systems, transcriptome and proteome expressions and metabolome/chemical constituent fingerprints. Although binary abundance-unweighted algorithms have been widely used in such holistic comparisons, most of the research data do not fall into an “all-or-none” qualitative category, leading to unsuitability of abundance-unweighted algorithms as analytic tools in these types of holistic comparisons. Using an RAPD molecular polymorphism study as the example, this paper reviews the prerequisites and limitations of the binary abundance-unweighted algorithms in holistic comparisons and the advantage of using abundance-weighted algorithms, which are mathematically general and suitable for any holistic analyses.

Keywords: abundance-weighted similarity computation, abundance-weighted cluster construction; polymorphism of molecular markers, holistic similarity, phylogenetics analysis, natural cordyceps sinensis

Abbreviations

ITS, internal transcribed spacer; CAPS, cleaved amplified polymorphic sequence; DAF, DNA amplified fingerprints; ISSR, inter-simple sequence repeat; RAPD, random amplified polymorphic DNA; RFLP, restriction fragment length polymorphism; SCAR, sequence characterized amplified regions; SSCP, single-strand conformation polymorphism; SSR, simple sequence repeat; HS, hirsutella sinensis

Introduction

Although numerous microcosmic research techniques have been developed and widely used to examine individual chemicals (DNA, RNA, proteins, carbohydrates and other small-molecule chemicals) in sample systems, macrocosmic holistic techniques exhibit advantages in systematic comparisons of genetic diversity and genetic distance between sample systems, transcriptome and proteome expressions and metabolome/chemical constituent fingerprints. In holistic research projects, binary abundance-unweighted algorithms have been widely used and are limited to two sample comparisons. However, most of data generated from holistic studies do not fall into an “all-or-none” qualitative research category; thus, abundance-unweighted algorithms may not serve as suitable analytic tools for these types of research. Using an RAPD molecular polymorphism study as the example, this paper reviews the prerequisites and limitations of using the binary abundance-unweighted algorithms and the advantage of abundance-weighted algorithms in holistic studies.

Since 1999, PCR-based nrDNA Internal Transcribed Spacer (ITS) sequencing has been used to determine the taxonomic status of the examined fungal specimens in the natural Cordyceps sinensis insect-fungi complex as an auxiliary molecular method to examine the anamorph-teleomorph connection of Ophiocordyceps sinensis.^1-4 Scientists have reported the molecular heterogeneity of C. sinensis-associated fungi of the genera Hirsutella, Paecilomyces and Tolypocladium as well as Geomyces pannorum, Cladosporium macrocarpum, Phaeosphaeria pontiformis and Neosetophoma samarorum (in total, more than 90 species spanning at least 37 genera).^5-21 SNP mass spec genotyping and amplicon sequencing with or without molecular cloning have identified the mutant ITS sequences of at least 17 genotypes of O. sinensis fungi with multiple, scattered point mutations or DNA segment substitution hereditary variations from the natural C. sinensis insect-fungi complex.^{1-4,10-13,19,22-28}. The successful production of artificial C. sinensis enabled the molecular examination of the fungal strains that were used as inoculation agents and the correlation of the genotype results with the finished product, artificial C. sinensis.²⁸ They reported the identification of the only fungal genotype, Genotype #1 H. sinensis, from 3 anamorphic fungal strains that were used as the inoculation agents and of the sole teleomorphic Genotype #4 of O. sinensis in the fruiting body and caterpillar body of artificial C. sinensis. Therefore, this paper presented the paradox of “planting melon seeds and harvesting beans” and provides strong evidence against the sole O. sinensis anamorph hypothesis for Genotype #1 H. sinensis, which was proposed a decade ago by the same corresponding author and colleagues.²⁹

The microcosmic mycological and molecular results have fueled speculations, paradoxes, hypotheses and non-conclusive debates in mycological systematics studies, similar to blind men each touching a portion of an elephant (盲人摸象 in Chinese). To address this shortcoming, the macrocosmic molecular marker polymorphism approach has been used as a component of overall molecular systematics strategies to profile the natural C. sinensis insect-fungi complex as a holistic entity and to compare the holistic polymorphic similarities of the systems without the requirement of precise examinations of the DNA sequences or the individual taxonomies of the component fungi. These macrocosmic molecular techniques for holistic comparisons include AFLP (Amplified Fragment Length Polymorphism), CAPS (Cleaved Amplified Polymorphic Sequence), DAF (DNA Amplified Fingerprints), ISSR (Inter-Simple Sequence Repeat), RAPD (Random Amplified Polymorphic DNA), RFLP (Restriction Fragment Length Polymorphism), SCAR (Sequence Characterized Amplified Regions), SSCP (single-strand conformation polymorphism) and SSR (Simple Sequence Repeat).^1-4,30,31 Among these methodologies, RAPD molecular marker polymorphism analysis is the most frequently used technique for comparing overall similarities or dissimilarities (genetic distances) and exploring the phylogenetic (cluster) relationship between the test systems,^2,29,32-41 although it has been suggested that ISSR may be more sensitive than RAPD ^42-44 and costly metagenomic approaches may demonstrate advantages in qualitative studies of microbial genetic diversity and molecular ecology.^45-47 Among the key factors in study design for molecular marker polymorphism comparisons, the importance of unbiased selection of a plurality of random primers for random amplification of the genomic DNA templates isolated from the examined systems has been addressed.^1-4,40,41 The use of only a few random primers without reporting the objectivity and representativeness of the selection could lead to bias in the data analysis and thus biased conclusions.^1-4,40,41,48 When C. sinensis insect-fungi complex samples and fungal strain samples were not profiled as a whole, this resulted in the inaccuracy of holistic comparisons, interpretations and conclusions.²⁹ Because holistic analysis of molecular marker polymorphisms relies upon similarity computation and phylogenetic (cluster) tree construction, selection of analytical algorithms becomes the next question in study design for precise analyses and comparisons of the holistic polymorphisms between study systems.

Computational biology algorithms for polymorphism similarity analysis

Numerous studies have consistently used the PCR amplicon abundance-unweighted algorithm known as the Nei-Li Coefficient equation (1) ⁴⁹ or similar ⁵⁰ for similarity computations. The Nei-Li equation:

$S = \frac{2 N_{A B}}{N_{A} + N_{B}}$ (1)

Where
S= Similarity
N_AB=Number of matched DNA bands in lanes A and B amplified through PCR and
N_A and N_B = Total number of DNA bands in lanes A and B, respectively.
This equation was designed as an abundance-unweighted algorithm to analyze the “all or none” data, to compare pure systems in pairs and, in particular, to analyze the loss of restriction sites due to mutations.⁴⁹ The proper use of this and similar algorithms has 2 computational biological prerequisites:

All matched DNA pairs in the electrophoretic lanes being compared must have essentially the same densities and
All DNA amplicons must be well separated from the adjacent DNA moieties with similar molecular weights and conformations by electrophoresis.^40,41

However, in reality, as shown in Figure 1, many matches have similar or dissimilar densities for the DNA bands of the pairs of electrophoretic lanes and some amplicons are not well separated from the adjacent DNA moieties. Analysis using the Nei-Li equation (1) ⁴⁹ neglected the proper weights of those high-density amplicons in the similarity computation, such as the 570-, 1,700-, 1,974-bp bands in the left panel of Figure 1 and the 973-bp band in the right panel. This resulted in the multiple low-density amplicon bands being incorrectly over-weighted in the similarity computation and those high-density amplicon bands being largely relatively devalued. Table 1 compares the similarities computed with use of the abundance-unweighted Nei-Li equation (1) and the abundance-weighted ZUNIX equation (2). DNA amplicons were amplified by PCR using random primers S31 (Left Panel of Figure 1) or OPB-01 (Right Panel). The genomic DNA used as the template for PCR was isolated from the ascocarp portion (AC) of wild C. sinensis and Hirsutella sinensis (Hs) mycelia.

Figure 1 Agarose gel electrophoresis examining RAPD molecular marker polymorphisms of the ascocarp portion (AC) of wild Cordyceps sinensis and Hirsutella sinensis (HS) with using primers S31 and OPB-01.⁴¹

Primers	Similarity		*% Change (Unweighted vs.* Weighted)**
Primers	By Abundance-Unweighted Nei-Li Equation (1)	By Abundance-Weighted ZUNIX Equation (2)	*% Change (Unweighted vs.* Weighted)**
S31	0.5	0.23	-54%
OPB-01	0.29	0.83	+189%

Table 1 Comparisons of the polymorphism similarities computed using different algorithms.
Note: The molecular marker polymorphisms of the ascocarp portion (AC) of Cordyceps sinensis and the mycelial culture of Hirsutella sinensis (Hs) were examined by PCR amplification using random primers S31 and OPB-01 and by agarose gel electrophoresis, as shown in Figure 1. The overall polymorphism similarities were computed using either the abundance-unweighted Nei-Li equation (1) or the abundance-weighted ZUNIX equation (2).

Due to non-satisfaction of the aforementioned computational biology prerequisites, the Nei et al.⁴⁹ equation (1) is clearly not suitable for similarity computations for the RAPD studies of C. sinensis and results in a loss of molecular information provided by the abundance/density of the DNA amplicons during RAPD polymorphism analysis because of its primitive understanding of reality. New arithmetic methods for similarity computation are required to quantify the molecular information provided by the abundance/density of the DNA amplicons, which are lost or partially lost during RAPD polymorphism analysis using the Nei-Li equation (1).⁴⁹ Accordingly, the ZUNIX equations were formulated (The software for ZUNIX algorithms can be downloaded for free from www.ebioland.com/ZUNIX.htm ⁴⁰) for abundance-weighted similarity computation. These new arithmetic methods for similarity computation are required to quantify the molecular information provided by the abundance/density and migration speed of the DNA amplicons in gel electrophoresis. The ZUNIX equation (2) was formulated to compare the polymorphisms of 2 electrophoretic lanes, where d_ik≥ 0, i= 1,2 and k= 1,2, …, m define the measure of similarity, as follows ⁴⁰:

$S = \frac{\sum_{k = 1}^{m} [2 M i n {d_{1 k}, d_{2 k}}]}{\sum_{k = 1}^{m} d_{1 k} + d_{2 k}}$ (2)

where the similarity of the 2 densities d_1k and d_2k is the common portion of their values. The ZUNIX equation (2) defines similarity as the total density of all common parts present in the samples of comparison divided by the total density of all bands across the samples. This algorithm is mathematically general with no specific prerequisites and governs all conditions ⁴⁰ in contrast to the abundance-unweighted Nei-Li equation (1),⁴⁹ which narrows the specific cases under the strict prerequisites (a) and (b) described above. The second ZUNIX equation (3) is suitable for comparing the DNA amplicons in 3 electrophoretic lanes, where d_ik ≥ 0, i = 1,2,3 and k = 1,2, …, m and are shown below.⁴⁰

$S = \frac{\sum_{k = 1}^{m} [3 M i n {d_{1 k}, d_{2 k}, d_{3 k}}]}{\sum_{l = 1}^{m} d_{1 l} + d_{2 l} + d_{3 l}}$ (3)

Extending further, the third ZUNIX equation (4) is suitable for comparisons of DNA amplicons in more than 3 lanes, where d_ik ≥ 0, i = 1,2, …, n and k = 1,2, …, m and are described as⁴⁰

$S = \frac{\sum_{k = 1}^{m} [n M i n {d_{1 k}, d_{2 k}, ......., d_{n k}}]}{\sum_{r = 1}^{n} \sum_{s = 1}^{m} d_{r s}}$ (4)

The ZUNIX equations (2) (3) (4) arithmetically consider the following:

The unmatched DNA (or protein or other chemical) bands and their densities (abundance)
Differences in the density (abundance) of the matched DNA (or protein or other chemical) bands (or peaks or areas under the curves) and
The ability to compare multiple samples.

The abundance-weighted ZUNIX equations (2) (3) (4) define similarity as the total density (abundance) of all common parts present in the matched DNA bands of the samples being compared divided by the total density (abundance) of all bands across the samples.⁴⁰ The ZUNIX equations are mathematically general, with no specific prerequisites, and govern all conditions including the special cases under the strict prerequisites set forth by the Nei-Li equation (1).⁴⁹ The ZUNIX equations accurately capture all of the molecular information buried in the amplicon DNA bands (both the abundance/density and the migration speed in gel electrophoresis) in the RAPD (or ISSR, SSCP, or other holistic analysis techniques) gel images, which are partially lost or even significantly lost when incorrectly using the abundance-unweighted Nei-Li equation (1).⁴⁹ Consequently, misuse of the Nei-Li equation (1) ⁴⁹ in C. sinensis holistic polymorphism studies when the sample systems do not meet the specific computational biology prerequisites may lead to inaccurate calculations of overall similarities and questionable conclusions.^35,44 Yao et al.⁴¹ employed abundance-weighted ZUNIX equations (2) (3) (4) to analyze polymorphic data (1,418 amplicon bands) obtained from 20 RAPD gel images using 20 random primers and reported similarities of 0.546-0.686 between H. sinensis and the stroma, caterpillar body and ascocarps of natural C. sinensis. The results are inconsistent with the sole O. sinensis anamorph hypothesis for Genotype #1 H. sinensis, which was proposed by Wei et al.²⁹ and Guo et al.⁵¹ The mathematically general, abundance-weighted ZUNIX equations can also accurately calculate the similarities of mass spec proteomic polymorphisms of multiple C. sinensis samples and the similarities of HPLC fingerprints and other chemical chromatographic and electrophoretic profiles.^40,52

Abundance-weighted algorithms for phylogenetic tree construction

In previous RAPD and ISSR studies of C. sinensis, phylogenetic analysis primarily used binary PCR amplicon abundance-unweighted UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to construct phylogenetic trees.^{29,33,34,38,43,44} Consistent with the computation of similarity, constructing phylogenetic trees using abundance-unweighted algorithms in the holistic analysis negates the differences between high- and low-density DNA amplicon bands or between the complete and incomplete separation of DNA amplicon moieties by agarose gel electrophoresis and their impact on the weights in exploring phylogenetic similarity and dissimilarity, leading to computational errors in constructing phylogenetic trees. Ni et al. [40] demonstrated the inaccurate construction of a phylogenetic tree improperly using the abundance-unweighted algorithm, reflecting the inability to capture and analyze all molecular information buried in the DNA bands (both the abundance/density and the migration speed in agarose gel electrophoresis) in RAPD gel images, whereas the abundance-weighted algorithms corrected such analytical errors. Thus, the selection of different clustering algorithms with or without consideration of the abundance/densities and incomplete separation of the DNA moieties (or RNAs, proteins, or other chemicals) greatly impacts the data analysis and study conclusions and the abundance-unweighted algorithm is not suitable for studies of C. sinensis, which contains multiple intrinsic fungi.^40,41

As indicated by Ni et al.⁴⁰ PAUP 4.0B requires semi-quantitative scoring prior to phylogenetic tree construction when using the abundance-weighted algorithm to construct phylogenetic trees, which may slightly reduce the sensitivity in handling fully quantitative abundance/density data. Therefore, Ni et al.⁴⁰ performed their clustering analysis using software with full quantitative capacity, such as Cluster3.0, JMP9 and SPSS and they demonstrated that the fully quantitative algorithms placed H. sinensis in a separate clade from the main C. sinensis cluster at a large rescaled distance. Both the semi-quantitative and fully quantitative methods, however, exhibited advantages in capturing all molecular information and accurately constructing phylogenetic trees in the C. sinensis molecular and proteomic polymorphism studies.^40,41,52 Other advantages of software for fully quantitative clustering include ease of use and accurate quantitation, but the full-quantitative algorithms provided by the aforementioned software do not include bootstrap value calculation, whereas the semi-quantitative clustering algorithm provided by PAUP 4.0B does calculate the bootstrap value (usually Bootstrap=1000). Yao et al.⁴¹ employed the abundance-weighted algorithms to analyze 1,418 DNA amplicon bands obtained from 20 RAPD gel images using 20 random primers and constructed a phylogenetic tree showing an isolated H. sinensis leaf that is greatly distant from the C. sinensis clade. This large phylogenetic distance and dissimilarities (0.546-0.686) between H. sinensis and the stroma, caterpillar body and ascocarp samples of C. sinensis determined using the abundance-weighted algorithms is inconsistent with the sole O. sinensis anamorph hypothesis for H. sinensis. The types of algorithms and software should be carefully considered when designing RAPD, ISSR, SSCP and other holistic profile comparison studies, although both fully and semi-quantitative abundance-weighted clustering algorithms can be used in general for C. sinensis molecular and proteomic polymorphism studies.^40,41,52

The mathematically general, abundance-weighted algorithms have also been employed in a proteomatic polymorphism study,⁵² in which the proteins of C. sinensis insect-fungi complex and of mycelial cultures of H. sinensis and Paecilomyces hepiali were extracted and separated through SELDI-TOP mass spectrometry. In this study, Dong et al.⁵² analyzed more than 1,900 protein bands, computed the holistic similarities between the samples and constructed a phylogenetic tree. Again, the abundance-weighted algorithms have exhibited advantages in capturing all of the proteomic information buried in the mass spec peaks/bands (both the peak heats or abundance and the migration speed in SELDI-TOP mass spectrometry) and in accurately calculating the similarities of proteomic polymorphisms of the samples. This study reported profound, dynamic, asynchronous changes in holistic proteomic polymorphisms in the stroma and caterpillar body of natural C. sinensis during maturation, reflecting dynamic changes in the expression of transcriptomic genes in the different compartments of C. sinensis at different stages of maturation. This study also reported significant differences in holistic proteomic polymorphisms between samples of natural C. sinensis and mycelial cultures of H. sinensis and P. hepiali, indicating that neither of the fungal culture preparations might be a stand-along therapeutic regime to replace natural C. sinensis.

Conclusion

These mathematically general, abundance-weighted algorithms can be used with no computational biology prerequisites and limitations for overall similarity computation and phylogenetic (cluster) tree construction in holistic comparison studies of genetic diversity and genetic distance, transcriptome and proteome expressions, metabolomics and chemical constituent fingerprints using techniques including gel electrophoresis, two-dimensional gel electrophoresis, capillary electrophoresis, mass spectrometry, chromatography, etc.