XGlycScan: an open-source software for n-linked glycosite assignment, quantification and quality assessment of data from mass spectrometry-based glycoproteomic analysis

doi:10.15406/mojpb.2014.01.00004

MOJ

eISSN: 2374-6920

Proteomics & Bioinformatics

Short Communication Volume 1 Issue 1

XGlycScan: an open-source software for n-linked glycosite assignment, quantification and quality assessment of data from mass spectrometry-based glycoproteomic analysis

Paul Aiyetan,

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Bai Zhang, Zhen Zhang, Hui Zhang

Department of Pathology, Johns Hopkins University School of Medicine, USA

Correspondence: Paul Aiyetan, Department of Pathology, Johns Hopkins University School of Medicine, 1550 Orleans Street, CRBII, Room 3M 01 - 07, Baltimore, MD 21231, USA, Tel +4432874306, Fax +4432876388

Received: May 08, 2014 | Published: June 2, 2014

Citation: Aiyetan P, Zhang B, Zhang Z, et al. XGlycScan: an open-source software for n-linked glycosite assignment, quantification and quality assessment of data from mass spectrometry-based glycoproteomic analysis. MOJ Proteomics Bioinform. 2014;1(1):11-16. DOI: 10.15406/mojpb.2014.01.00004

Download PDF

Abstract

Mass spectrometry based glycoproteomics has become a major means of identifying and characterizing previously N-linked glycan attached loci (glycosites). In the bottom-up approach, several factors which include but not limited to sample preparation, mass spectrometry analyses, and protein sequence database searches result in previously N-linked peptide spectrum matches (PSMs) of varying lengths. Given that multiple PSMs map to a glycosite, we reason that identified PSMs are varying length peptide species of a unique set of glycosites. Because associated spectra of these PSMs are typically summed separately, true glycosite associated spectra counts are lost or complicated. Also, these varying length peptide species complicate protein inference as smaller sized peptide sequences are more likely to map to more proteins than larger sized peptides or actual glycosite sequences. Here, we present XGlycScan. XGlycScan maps varying length peptide species to glycosites to facilitate an accurate quantification of glycosite associated spectra counts. We observed that this reduced the variability in reported identifications of mass spectrometry technical replicates of our sample dataset. We also observed that mapping identified peptides to glycosites provided an assessment of search-engine identification. Inherently, XGlycScan reported glycosites reduce the complexity in protein inference. We implemented XGlycScan in the platform independent Java programming language and have made it available as open source. XGlycScan's source code is freely available at https://bitbucket.org/paiyetan/xglycscan/src and its compiled binaries and documentation can be freely downloaded at https://bitbucket.org/paiyetan/xglycscan/downloads. The graphical user interface version can also be found at https://bitbucket.org/paiyetan/xglycscangui/src and https://bitbucket.org/paiyetan/xglycscangui/downloads respectively.

Keywords: bioinformatics, peptide, glycopeptides, glycosite, protein identification, proteomics, quality assessment

Introduction

Glycoproteins play major roles in many biological systems.^1–4 They are synthesized as products of co-translational and posttranslational modification processes known as glycosylation.⁵ Of the major glycosylation processes observed in humans, the N-linked type is the most predominant.⁵ N-linked glycosylation is the transfer of oligosaccharides onto an Asparagine (N) residue in an N-x-(ST) sequence motif of nascent polypeptides.⁶ Characterizing the sequences of these polypeptides (or peptides), glycopeptides and glycoproteins in complex biological mixtures has evolved to primarily entail shotgun approaches.^7–13 This involves, but not limited to, sample preparation, mass spectrometry, protein database search, and protein inference.¹⁴ The shotgun approach is of the premise that the presence of a protein in a complex biological mixture can be inferred from mass spectrometry identified peptide sequences. The effects of these processes in addition to physicochemical properties of peptides and proteins on reported identifications are well noted and documented. As a significant consequence, Peptide Spectrum Matches (PSMs) of varying length about specific glycosites are typically reported as identified. Given the fact that varying length PSM species map to a glycosite, we reason that identified PSMs are peptide species of a unique set of glycosites. With multiple peptides of varying length mapping to a glycosite is the fact that the actual spectra count of such a glycosite is distributed across these peptide species (Figure 1B). This complicates true quantification of referenced glycosite. Also, with varying length peptide species of a glycosite is the attending increase in the number of possibly mapped proteins. This as well complicates protein inference (Figure 1C).^15,16

XGlycScan focuses on the identified peptide sequences of attached glycans as opposed to other yet inadequate tools for automated glycopeptide analysis.¹⁷ Many of these predominantly focus on the structural composition of attached glycans.^17–27 Although some others tend to characterize the peptide sequence with attach glycans together,^28–44 very few to none of these tools describe glycopeptide sequences (which typically are peptide species of varying lengths) within the context of ‘peptide species of a set of unique glycosites'.

XGlycScan Implementation

We first introduced a concept of XGlycScan in Unipep⁴⁵ as non-redundant N-linked glycopeptide generation. Here, we present XGlycScan as a platform independent, open-source, and freely available (recommended attributes of an ideal automated glycopeptide analysis tool¹⁷) analytical tool to resolve the glycosites to which mass spectrometry identified PSMs map and to accurately quantify the abundance of such.

Algorithmically, for every input mzIdentML⁴⁶ peptide identification report file,

XGlycScan computes the false discovery rate, FDR of identified PSMs using Elias and Gygi's method.⁴⁷

At user specified filtering FDR, XGlycScan filters PSMs.

Filtered PSMs (that is, PSMs that pass specified cutoff) are mapped to protein sequence database reference glycosites.

XGlycScan evaluates all PSMs mapped to a glycosite for a PSM that best represent the referenced locus (Figure 1A). Glycosite-mapped-PSMs are evaluated based on:

Number of tryptic ends,

Number of missed cleavages, and

User specified PSM scoring metric which by default in this implementation is the SEQUEST XCorr.⁴⁸

Thereafter, XGlycScan computes a true spectra count of mapped glycosites (Figure 1B) and other computations as described below.

Figure 1A The core sequential steps in XGlycScan. Filtered mass spectrometry Peptide Spectrum Matches (glycopeptides) are mapped to putative N-linked glycosites of database searched. These glycosite mapped peptides are evaluated to select a best representation of mapped loci. And, true site quantification (in terms of spectra count) is then re-computed.

Figure 1B An illustration of the distribution of associated spectra counts of a putative N-linked glycosite YNQSEAGSHTLQGMNGCDMGPDGR. In a typical database search identification and spectra count estimation, the true quantification of this glycosite may be complicated as it is distributed among site-mapped reported varying length peptide species (YN#QSEAGSHTL-3, YN#QSEAGSHTLQGM-1, YN#QSEAGS-2, and YN#QSEAGSHTLQGMNGCDMGPDGR-3). XGlycScan associates or maps these PSMs to the glycosite, evaluates these PSMs for the best representation of the glycosite, and re-computes an actual spectra count to derive YNQSEAGSHTLQGMNGCDMGPDGR with a true spectra count of 9 (3 + 1 + 2 + 3). The # sign denotes previously attached N-glycan site.

Figure 1C Line plot of number of mapped proteins as a function of peptide length. The Figure shows the number of proteins to which possible digest products of an HLA Class I Histocompatibility antigen, alpha Chain glycosite (YN#QSEAGSHTLQGMNGCDMGPDGR) maps. The x-axis indicates the theoretical enzyme cleavage position relative to the canonical downstream tryptic cleavage site. The y-axis indicates the number of proteins to which derived peptide sequences map to. This shows that the number of database (NCBI, RefSeq) proteins to which digest peptides map can vary by many orders of magnitude depending on the length. With increase in possible proteins is the associated increase in the complexity of a correct protein inference.

We have used SEQUEST XCorr in this implementation of XGlycScan as a place-holder for one of the many possible scoring metric options we plan to incorporate with subsequent software iteration. In anticipation of foreseen situations where input mzIdentML files are generated from different search engines, we did implement the Elias and Gygi's FDR computation to provide a uniform method for FDR estimation across input files. However, to allow for some flexibility, users are presented with an option of specifying whether to compute FDR or not on input PSM identifications. We shall possibly consider presenting the option of using the specific search-engine derived FDR or not in subsequent software iteration. XGlycScan's computation results are reported in the xGlycScan.tables sub-directory in a user specified output location. These include:

spectraCount.matrix: In this tab-delimited file is reported the re-computed and actual spectra count for mapped glycosites in each input file. The rows in this spectra count table represent unique glycosites identified and the columns represent individual input file.

identification.indeces: Within this tab-delimited file is reported the total (Identified peptides) and unique (Unique peptides) number of peptide spectrum matches; total (Identified glycosites) and unique (Unique glycosites) glycosites for each input file. The specificity of identification in each input file is also reported in this file. By mapping back to reference glycosites, XGlycScan performs a quality assurance function as unsuccessfully mapped PSMs reported by database search-engine should raise the suspicion of spurious peptide to spectrum assignments or questionable antecedent processes. XGlycScan defines its search engine identified PSM quality metric (specificity) as a function of the ratio of mapped glycosites to total PSMs reported (Figure 2C) from database searches.

Figure 2A Bar-chart of identifications. Identified peptides (blue), identified glycosites (red), unique peptides (green), unique glycosites (purple). Identified peptides are total peptide spectrum matches, PSMs identified in an MSMS run passing user-specified filtering threshold. This defaults to less than or equal to 0.01, false discovery rate, FDR. Identified glycosites are total PSMs matching to predefined N-linked glycan attached loci (glycosite). Unique peptides and glycosites are as the names imply.

Figure 2B Line plot of technical replicates identification variances. There is observed a consistently lesser variation in glycosite quantitation within technical replicates than that observed in peptide (PSM) quantitation.

Figure 2C Line plot of XglycScan defined specificities across input files. This is defined as a function of the ratio of mapped glycosites to total PSMs reported in the respective input mzIdentML file.

identification.coef: Within this tab-delimited file is reported the number of glycosites unique to each input file and the ratio of this to the total unique glycosites identified in all input files.

iDOverlapCount.matrix: In this tab-delimited square matrix file is reported the number of glycosites found in common in absolute number for all possible pair-wise comparison of input files.

iDOverlapPercent.matrix: Similar to the iDOverlapCount.matrix, in this tab-delimited square matrix file is reported all possible pair-wise comparison of input files. Herein is reported the number of glycosites found in common as a percentage of unique glycosites' union between the paired inputs.

Other computation results are reported in the following sub-subdirectories:

glycs: Contains tab-delimited outputs of identified peptides, associated mapped glycosites (N-Linked glycosylation site), modifications, search engine rank, charge, m/z and scan number found in each input mzid file.

gmaps: Reports the mapped protein accession, location, formatted glycosite sequence, number of tryptic ends, best peptide identification value, theoretical (unmodified peptide mass), and associated mass spectrometry scan id (of best PSM identification) derived from each mzIdentML input file. These are derived maps for PSMs better than user specified false discovery rate (FDR).

groups: An optional output that is only outputted when a phenotype input file is found provided. It contains a group-based computation of spectra counts (see spectra Count. matrix) and identification indeces (see identification. indeces). Groupings are based on associated group information provided in the optional input phenotype file. This is expected to be a 2-column tab-delimited file with a header line. The left column is expected to be the list of input mzIdentML (.mzid) file names and the right column is the group or phenotype to which respective file belong. In the group-based spectraCount.matrix output file in the “groups” output directory, the columns are the specified file groups while the rows are the identified unique glycosites. Likewise, in the group based identification.indeces tab-delimited file is reported the total (Identified peptides) and unique (Unique peptides) number of peptide spectrum matches; total (Identified glycosites) and unique (Unique glycosites) glycosites for each group.

values: Contains the computed P-value, FDR (False Discovery Rate) and Q-values for all peptide spectrum matches in each input mzIdentML file.

We implemented XGlycScan entirely in the Java programming language to ensure a wide range of operating system platform compatibility. Also to ensure compatibility with a wide range of search engine outputs, XGlycScan, by default, receives the proteomics community (PSI, Proteomics Standard Initiative) defined mzIdentML data format^46,49 files as input Table 1. XGlycScan utilizes the jmzIdentML Java API⁵⁰ in reading and accessing defined objects in input file (s). Result outputs are written to a tables' directory in user-specified output location. See documentation in the README file or at https://bitbucket.org/paiyetan/xglycscan/wiki/Home for details

S. no	File name	Sample group	Sample	Instrument
1	061413_TCGA_G11_1.mzid	QEXACT_G11	G11	QExactiveTM
2	061413_TCGA_G11_2.mzid	QEXACT_G11	G11	QExactiveTM
3	061413_TCGA_G11_3.mzid	QEXACT_G11	G11	QExactiveTM
4	061413_TCGA_G14_1.mzid	QEXACT_G14	G14	QExactiveTM
5	061413_TCGA_G14_2.mzid	QEXACT_G14	G14	QExactiveTM
6	061413_TCGA_G14_3.mzid	QEXACT_G14	G14	QExactiveTM
7	061413_TCGA_G5_1.mzid	QEXACT_G05	G05	QExactiveTM
8	061413_TCGA_G5_2.mzid	QEXACT_G05	G05	QExactiveTM
9	061413_TCGA_G5_3.mzid	QEXACT_G05	G05	QExactiveTM
10	TCGA_114C_24-1436-01A-01_13-2061-01A-02_36-2537-01A-01_G_ JHUZ_20130228_RUN1_NOFRACTION_130408174702.mzid	ORBIT_G11	G11	Orbitrap VelosTM
11	TCGA_114C_24-1436-01A-01_13-2061-01A-02_36-2537-01A-01_G_ JHUZ_20130228_RUN2_NOFRACTION_130408192810.mzid	ORBIT_G11	G11	Orbitrap VelosTM
12	TCGA_114C_24-1436-01A-01_13-2061-01A-02_36-2537-01A-01_G_ JHUZ_20130228_RUN3_NOFRACTION_130408210853.mzid	ORBIT_G11	G11	Orbitrap VelosTM
13	TCGA_114C_29-1696-01A-01_29-1771-01A-01_13-2066-01A-02_G_ JHUZ_20130228_RUN1_NOFRACTION.mzid	ORBIT_G14	G14	Orbitrap VelosTM
14	TCGA_114C_29-1696-01A-01_29-1771-01A-01_13-2066-01A-02_G_ JHUZ_20130228_RUN2_NOFRACTION.mzid	ORBIT_G14	G14	Orbitrap VelosTM
15	TCGA_114C_29-1696-01A-01_29-1771-01A-01_13-2066-01A-02_G_ JHUZ_20130228_RUN3_NOFRACTION.mzid	ORBIT_G14	G14	Orbitrap VelosTM
16	TCGA_114C_OVARIAN-CONTROL_25-2396-01A-01_36-2545-01A-01_G_ JHUZ_20130228_RUN1_NOFRACTION_130408112404.mzid	ORBIT_G05	G05	Orbitrap VelosTM
17	TCGA_114C_OVARIAN-CONTROL_25-2396-01A-01_36-2545-01A-01_G_ JHUZ_20130228_RUN2_NOFRACTION_130408130450.mzid	ORBIT_G05	G05	Orbitrap VelosTM
18	TCGA_114C_OVARIAN-CONTROL_25-2396-01A-01_36-2545-01A-01_G_ JHUZ_20130228_RUN3_NOFRACTION_130408144538.mzid	ORBIT_G05	G05	Orbitrap VelosTM

Table 1 Samples group information

XGlycScan's current implementation define reference glycosites as peptide sequences about the canonical N-x-[ST] motif [6] bounded within the immediate up and downstream trypsin cleavage sites. Given that some recent studies are beginning to provide evidences for glycosites not containing the canonical motif,¹¹ as part of future maintenance of XGlycScan and as evidences for such non-canonical motifs become stronger, we plan to integrate such information in defining reference glycosites.

Demonstrating functionality

To demonstrate XGlycScan's functionalities, we searched sample tandem mass spectrometry derived glycoproteome profile RAW files from two Thermo Scientific mass spectrometry instruments in Proteome Discoverer version 1.3. The mass spectrometers were Orbitrap Velos™ and Q Exactive™. The MSMS data were generated as part of the National Cancer Institute's (NCI), Clinical Proteomics Tumor Analysis Consortium (CPTAC), proteome characterization study. A full description of sample preparation and mass spectrometry protocols is publicly available and may be downloaded at the CPTAC data portal (https://cptac-data-portal.georgetown.edu/cptacPublic/) Also available for download are the sampled. RAW files from the Orbitrap Velos™ instrument. The Q Exactive™ instrument derived. RAW files are available on request.

We searched using Proteome Discoverer embedded SEQUEST⁴⁸ search engine. We searched against the NCBI RefSeq protein database (September 16, 2013 version). We specified the following search parameters – a full tryptic digestion and a maximum missed cleavage of 1, a precursor mass tolerance of 10ppm and a fragment mass tolerance of 0.06Da (Daltons), and ions series weight on b and y ions. We specified deamidation (+0.984016Da) of Asparagine (N) as dynamic modification and oxidation (+15.994915Da) of Methionine (M), modifications of peptide N-terminus with iTRAQ 4plex (+144.102Da) of any residue, iTRAQ 4plex modification (+144.102Da) of Lysine (K) and carbamidomethylation (+57.021Da) of Cysteine (C) as a static modifications. We allowed a maximum of 6 modifications per peptide. We converted our search result MSF files to the proteomics community defined mzIdentML standard format files using M2Lite⁵¹ Respective files and associated sample group information are listed in.

As input parameters in XGlycScan's configuration file, we specified our input file type as “MZIDENTML”, quantification type as “SPECTRA_COUNT”, protein sequence search database type as “REFSEQ”, compute false discovery rate (FDR) as “TRUE”, FDR filter or cutoff as “0.01”, evaluation value type as “SEQUESTXCORR”, and use top ranked as “FALSE”. Please see documentation at https://bitbucket.org/paiyetan/xglycscan/wiki/Home for more details.

(Figure 2A) summarizes the glycosite identifications reported by XGlycScan. Interestingly, XGlycScan evaluation appears to reduce the variability in the number of PSMs identified across mass-spectrometry technical replicates (Figure 2B). We observed specificities between 93 and 96 percent across all input identification files (Figure 2C). Significantly low specificity in identifications should raise a concern in possibly any of the preceding steps – sample preparation, mass spectrometry analysis or database peptide assignment.

Ultimately, by reducing redundancy in N-linked glycan attached loci, XGlycScan is expected to reduce the complexity in protein identification as fewer and more precise glycopeptide sequences should map to fewer proteins as illustrated in (Figure 1B).

(Figure 3A) shows a typical XGlycScan command-line session. (Figure 3B) shows a graphical user interface program initiation session.

Figure 3A A typical XGlycScan command-line session.

Figure 3B A graphical user interface program initiation session.

Software availability

XGlycScan's source codes are available as open-source at https://bitbucket.org/paiyetan/xglycscan/src and its compiled binaries and documentation can be freely downloaded at

https://bitbucket.org/paiyetan/xglycscan/downloads. The user-friendly graphical user interface version can be found at https://bitbucket.org/paiyetan/xglycscangui/src and https://bitbucket.org/paiyetan/xglycscangui/downloads

respectively. These are made available under the BSD 3-Clause open source license.

Acknowledgements

XGlycScan's development was supported by the National Institutes of Health, National Cancer Institute, Clinical Proteomic Tumor Analysis Consortium (CPTAC, U24CA160036) and the Early Detection Research Network (EDRN, U01CA152813), National Heart, Lung, and Blood Institute, Programs of Excellence in Glycosciences (PEG, P01HL107153). We do acknowledge members of the Center for Biomarker Discovery and Translation, Department of Pathology, Clinical Chemistry Division at the Johns Hopkins University School of Medicine.