Core Pseudomonas genome from 10 pseudomonas species

doi:10.15406/mojpb.2020.09.00282

MOJ

eISSN: 2374-6920

Proteomics & Bioinformatics

Short Communication Volume 9 Issue 3

Core Pseudomonas genome from 10 pseudomonas species

Xue Ting Tan,^1,2 Avettra Ramesh,^1,2 Victor CC Wang,^1,2 Nur Jannah Kamarudin,^1,2 Shermaine SM Chew,^1,2 Madhurya V Murthy,^1,2 Nikita V Yablochkin,¹ Karthiga Mathivanan,^1,2 Maurice HT Ling^1,2,3

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

¹Department of Applied Sciences, Northumbria University, United Kingdom
²School of Life Sciences, Management Development Institute of Singapore, Singapore
³HOHY PTE LTD, Singapore

Correspondence: Maurice HT Ling, School of Life Sciences, Management Development Institute of Singapore, 501 Stirling Road, Singapore 148951, Republic of Singapore, Singapore

Received: June 27, 2020 | Published: July 17, 2020

Citation: Tan XT, Ramesh A, Wang VCC, et al. Core Pseudomonas genome from 10 pseudomonas species. MOJ Proteomics Bioinform. 2020;9(3):68-71. DOI: 10.15406/mojpb.2020.09.00282

Download PDF

Abstract

Core genome of a set of organisms represents the set of homologous genes shared between the set of organisms with many applications. The Pseudomonas genus is highly diverse with both plant and animal pathogens. Hence, the core genome of Pseudomonas genus can be useful. Current studies presented contradictory results with the core genome of Pseudomonas genus marginally larger than that of Pseudomonas aeruginosa. In this study, we attempt to identify a core Pseudomonas genome from 10 publicly available annotated genomes by intersecting homologous coding sequences using BLAST. Our results suggest a 218-gene core genome, which is 3.46% of the coding sequences of P. aeruginosa. 136 of 218 genes were mapped to official gene symbols and were enriched in 8 clusters in Gene Ontology biological processes related to central metabolism.

Introduction

The core genome for a set of related genomes represents a set of orthologous genes within a set of related genomes,¹ which may be from different strains of a species² or different species of a genus.³ Hence, core genome represents the intersection of the set of genomes under study. Therefore, phylogenetically related genomes tend to share more genes and likely to have a larger core genome.⁴ This is different from pan-genome, which is the entire set of all genes from the genomes under study.⁵ There are many applications of core genomes. For example, the core genome is crucial to observe genomic distance within a species, which can then be used for disease surveillance and outbreak monitoring.^6,7 It can also be used to study speciation events⁸ and the evolutionary history of an organism.⁹

The Pseudomonas genus is one of the most diverse bacterial genera¹⁰ inhabiting a wide variety of environments,¹¹ including pathogens of both plants and animals.¹² For example, Batrich et al.,¹³ found a variety of Pseudomonas species demonstrating antibiotics resistance and metal tolerance near Lake Michigan. Hence, it is useful to elucidate the core genome of Pseudomonas genus for further applications. A study by Hesse et al.,¹⁴ examined 166 Pseudomonas type strains to deduce a core genome of 794 genes while Freschi et al.,¹⁵ focused on identifying Pseudomonas aeruginosacore genome and used 1,311 P. aeruginosa genomes sequences to obtain a 665-gene P. aeruginosa core genome.However, there is a contradiction–shouldthe core genome of P. aeruginosa is 665 genes,¹⁵ it is not likely for the core genome of Pseudomonas genus to be only 794 genes.¹⁴ This may be due to low stringency criteria in identifying orthologs used by Hesse et al.,¹⁴ which is 30% identity at 50% coverage; as compared to Freschi et al.,¹⁵ which is 50% identity at 85% coverage. This suggests that the core genome of Pseudomonas genus warrants further study.

Here, we attempt to identify a core Pseudomonas genome from 10 publicly available annotated genomes. Our results suggest a 218-gene core genome, which is 3.46% of the coding sequences of P. aeruginosa.

Materials and methods

Genome data set: The genome of 10 Pseudomonas species; namely, (i) Pseudomonas aeruginosa (Accession CP045002.1; P1), (ii) Pseudomonas mandelii (Accession NZ_CP005960.1; P2), (iii) Pseudomonas balearica (Accession CP045858.1; P3), (iv) Pseudomonas chlororaphis (Accession NZ_CP027716.1; P4), (v) Pseudomonas fluorescens (Accession NZ_CP048607.1; P5), (vi) Pseudomonas fulva (Accession NZ_CP023048.1; P6), (vii) Pseudomonas orientalis (Accession NZ_CP018049.1; P7), (viii) Pseudomonas psychrophila (Accession NZ_CP049044.1; P8), (ix) Pseudomonas putida (Accession NZ_CP026115.2; P9), and (x) Pseudomonas synxantha (Accession NZ_CP027754.1; P10); were obtained from NCBI.

Determining core genome by intersecting genomes: The core genome of Pseudomonas was determined as the intersection of the 10 Pseudomonas genomes. Operationally, the intersection of 2 genomes; such as, P. aeruginosa (P1) and P. mandelii (P2); was determined by constructing a BLAST database out of the coding sequences of P. aeruginosa and the coding sequences of P. mandelii were used as query in BLASTN¹⁶ version 2.10.0. The expectation value (E-value) in BLAST is defined as per-search expected false positive rate¹⁷ and was set to less than 1E-9,¹⁸ which had been used in pan-genomics¹⁹ and homology.²⁰ Only the top match was taken for each of the query sequences. The result represented the core genome of P. aeruginosa and P. mandelii (denoted as P1P2). Subsequently, the coding sequences of P. balearica (P3) was used to construct a BLAST database for sequence comparison with P1P2 under the same E-value threshold. The result represented the core genome of P. aeruginosa, P. mandelii and P. balearica (denoted as P1P2P3). This process was repeated until all 10 Pseudomonas genomes were intersected, which represented the core genome and was denoted as P1P2P3P4P5P6P7P8P9P10.

Determining functions of core genome: The functional properties of the core genome were determined by gene set enrichment analysis^21–23 for biological processes using PANTHER^24,25 on the official gene symbols.

Results and discussion

The number of coding sequence (CDS) ranges from to 4274 in P. balearica to 6305 in P. aeruginosa (Table 1). Using genome intersection, a 218-gene core genome was identified, which amounts to 3.46% of P. aeruginosa genome (Table 2). A study on 23 Corallococcus genomes²⁶ suggest that the size of pan-genome⁵ can be estimated to be 8127N^0.5481 genes, where N is the number of genomes. Using this estimation,²⁶ the size of pan-genome of the 10 Pseudomonas species is estimated to be 28,750 CDS or genes. Inglin et al.,²⁷ examined 98 complete genomes of the genus Lactobacillus and found the core and pan-genome to be 266 genes and 20,800 genes, respectively. This amounts to 1.28% of the pan-genome being the core genome. We evaluate the use of this core genome to pan-genome ratio in this case. Using this ratio, where the size of core genome is 1.28% of pan-genome, on our estimated 28,750-gene Pseudomonas pan-genome, we will expect a core genome of 368 genes, which 68% more than that identified in this study. The difference may be due to the higher stringency on the E-value threshold used in this study (E-value<1E-9), which is commonly used as threshold for pan-genomics¹⁹ and homology²⁰ studies, as compared to Inglin et al.,²⁷ whom uses E-value of less than 1E-5. This suggests that the estimation of the size of pan-genome²⁶ from number of genomes and the estimation of the size of core genome from the size of pan-genome by ratio²⁷ may be a useful heuristic (Table 1&2).

Label	Organism	Accession number	Number of CDS
P1	P. aeruginosa	CP045002.1	6305
P2	P. mandelii	NZ_CP005960.1	6139
P3	P. balearica	CP045858.1	4274
P4	P. chlororaphis	NZ_CP027716.1	5886
P5	P. fluorescens	NZ_CP048607.1	5914
P6	P. fulva	NZ_CP023048.1	4541
P7	P. orientalis	NZ_CP018049.1	5248
P8	P. psychrophila	NZ_CP049044.1	4737
P9	P. putida	NZ_CP026115.2	5561
P10	P. synxantha	NZ_CP027754.1	6135

Table 1 Number of Coding Sequences (CDS) in each organism

CDS Set	Number of CDS	Percentage
P1	6305	100.00%
P2	6139	97.37%
P1P2	1320	20.94%
P1P2P3	1294	20.52%
P1P2P3P4	796	12.62%
P1P2P3P4P5	575	9.12%
P1P2P3P4P5P6	402	6.38%
P1P2P3P4P5P6P7	344	5.46%
P1P2P3P4P5P6P7P8	237	3.76%
P1P2P3P4P5P6P7P8P9	230	3.65%
P1P2P3P4P5P6P7P8P9P10	218	3.46%

Table 2 Progressive reduction of number of CDS

Of the 218-genes core genome identified, 136 (62.4%) genes were mapped to official gene symbols for gene set enrichment analysis.^21–23 Our results show an enrichment in eight biological process ontological terms; namely, (i) Guanosine-containing compound metabolic process (GO:1901068), (ii) glutamine family amino acid metabolic process (GO:0009064), (iii) purine nucleotide metabolic process (GO:0006163), (iv) purine-containing compound biosynthetic process (GO:0072522), (v) tRNA aminoacylation for protein translation (GO:0006418), (vi) small molecule biosynthetic process (GO:0044283), (vii) response to nutrient levels (GO:0031667), and (viii) aerobic respiration (GO:0009060).

The first five enriched terms (GO:1901068, GO:0009064, GO:0006163, GO:0072522, and GO:0006418) represent central metabolic processes for growth, which is similar to the core genome of Comamonas.²⁸ Small molecule biosynthetic process (GO:0044283) are often related to response to nutrient levels (GO:0031667), which are also found in the core genome of Acidithiobacillus.²⁹ Aerobic respiration is expected as Pseudomonas are generally aerobic.^30,31 Hence, the biological processes of Pseudomonas core genome identified in this study are supported by current studies in other bacterial genus.

In conclusion, this study identified a 218-gene core genome of Pseudomonas, which is linked to central metabolic processes and nutrient metabolism.

Data availability

The data files for this study can be downloaded at https://bit.ly/CorePseudomonasGenome, which is a zip file containing four folders; namely, (i) FASTA Files contain the 10 Pseudomonas genomes, (ii) BLAST Files contain the results from BLASTN, (iii) Intersection Files contain the progressive genomic intersections after BLAST where P1P2P3P4P5P6P7P8P9P10.fasta is the core genome of the 10 Pseudomonas species, and (iv) Core Genome contains the description and GSEA results of the core genome.