The internal oligopeptide sequences missing in crystals are isordered domains

doi:10.15406/mojpb.2018.07.00215

MOJ

eISSN: 2374-6920

Proteomics & Bioinformatics

Research Article Volume 7 Issue 1

The internal oligopeptide sequences missing in crystals are isordered domains

Natasha Kelkar,¹ Sohan P Modak²

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

¹Institute of Bioinformatics and Biotechnology, S. P. Pune University, India
²Open Vision, India

Correspondence: Sohan P Modak, Open Vision, 759/75, Deccan Gymkhana, Pune 411004, India

Received: February 13, 2018 | Published: February 23, 2018

Citation: Kelkar N, Modak SP, The internal oligopeptide sequences missing in crystals are isordered domains. MOJ Proteomics Bioinform. 2018;7(1): 00215. DOI: 10.15406/mojpb.2018.07.00215

Download PDF

Abstract

Polypeptide sequences in pdb format are invariably shorter than those in FASTA format. The missing residues are mostly internal oligopeptide strings and few C & N terminal residues. We have compared the panorama of the secondary structure domains generated from both formats by folding in silico and find that the missing oligopeptides are mostly from the intrinsically distorted domains.

Keywords: protein crystals, fasta format, pdb fomat, protein secondary structure. disordered domain. α helix, β sheet, internal missing oligopeptides

Introduction

Prior to their maturation as a biological structure or function, nascent polypeptides fold to form three dimensional structures composed of α helices, β sheets and disordered regions. The amino acid sequence of the processed polypeptide is stored in FASTA format (www.rcsb.org) and it is almost always longer than that in the crystal structure, retrievable in PyMol stored in pdb format, wherein the absence of residues has been noted at the C-terminal, N-terminal and at intra-polypeptide locations of crystals. Indeed, a large number of protein crystals in the data base exhibit internal missing string.¹ Crystallographers generally consider that the missing residues are due to low electron density undetectable in low resolution crystallography. Since some of the gaps at the N and C termini can be attributed to post- translational processing, the presence of missing internal oligopeptides may lead to misinterpretation of the secondary structure domains in the immediate vicinity of the gaps as well as in the flanking segments. While studying the phylogeny of proteins² we considered the possibility that the extent of evolutionary conservation of residues defining individual secondary structure domains may be one of the determinants. As we came across the cases of internal missing intra-molecular residues here we analyze their structure and significance.

Materials and methods

Amino acid sequences of 9 proteins were downloaded from RCSB pdb in FASTA and crystal format.³ (www.rscb.org) These are, (1) SAICAR synthase from Saccharomyces cerevisiae, strain ATCC 204508/S288c (PDB Id : 1A48),⁴ (2) SAICAR synthase complexed with ADP,AICAR, and succinate from the same strain as above (www.rscb.org), (3) Lipoate-protein ligase A from Streptococcus agalactiae (PDB Id: 2P0L) (www.rscb.org), (4) P450 pyr hydroxylase from Sphingopyxis macrogoltabida (PDB Id : 3RWL),⁵ (5) Hydroxymethylbilane synthase from Escherichi coli (strain K12) (PDB Id : 2YPN),⁶ (6) UDP-n-acetylmuramoyl-L-alanine:Dglutamate ligase from Escherichia coli (K12) (PDB Id: 1UAG),⁷ (7) Glycinamide ribonucleotide synthetase from Escherichi col (strain K12) (PDB Id: 1GSO),⁸(8) Folypolyglutamate synthetase from Lactobacillus casei (PDB Id : 1FGS)⁹ and (9) mitochondrial helicase suv3 from Homo sapiens (PDB Id : 3RC3).¹⁰

The amino acid sequences in two formats were aligned and residues missing at the N-terminal, C- terminal and internal regions were detected. Sequences of 9 proteins were folded with JPred 4 (http://www.compbio.dundee.ac.uk/jpred4) and PSSPred.^11–15 From the output we designated residues forming secondary structure domains in different shades, namely light gray (α helix), dark gray (β sheet/loop) and medium gray (disordered domain). The sequences derived from the crystals (pdb format) were similarly shaded.

Results

The sequences in both formats of nine proteins are shown in Figure 1. We noticed that, in contrast to the sequence derived from FASTA file, some amino acids were missing at the termini as well as at internal locations of the polypeptide in the crystal-derived sequences. Upon folding these in silico with Jpred4, we find (Figure 1) that each polypeptide gave rise to lawns exhibiting α-helix, β sheet, and disordered domains/random coil (methods). Since the folding pattern with respect to the number and positions of different structural domains was nearly similar with PPSPred, we have restricted this presentation to J Pred4 for proteins no 1-9.

Figure 1 Panorama of secondary structures from mature protein sequence (FASTA) and crystal structure sequence folded with JPred4: The different secondary domains are coloured according to gray scale. The crystal derived sequence is also colour coded according to secondary structures in grey scale and the arrows indicate the missing region. The missing oligopeptide region in crystal derived sequence in highlighted in the FASTA sequence.

Table 1 shows the number of amino acid residues missing in crystal-derived sequences. 2P0L, 3RWL and 3RC3 also exhibit long missing oligopeptides at the termini. Indeed, all crystal-derived sequences contain one or more 3-33 long internal missing oligo-peptides. Table 2 describes the distribution of missing residues in crystals based on their physicochemical properties and number. These were highlighted in sequences from mature protein (FASTA file) in Figure 1. We find that, in 10 cases, more hydrophilic residues are missing in the internal oligopeptide. In the rest 6, the ratio of hydrophobic residues to total number of missing residues is more than 0.5.

Protein	PDB Id	Missing residues in crystal derived sequence
		N terminal	C terminal	internal strings
Saicar synthase	1A48	1	0	7
Saicar synthase	2P0L	3	16	3
Lipoate-protein ligase A	3RWL	15	0	7
P450 pyr hydroxylase	2YPN	2	0	17
Hydroxymethylbilane synthase	2CNQ	1	0	3
UDP-n-acetylmuramoyl- Lalanine: D-glutamate ligase	1UAG	0	0	5, 4
Glycinamide ribonucleotide synthetase	1GSO	0	0	6, 3
Folypolyglutamate synthetase	1FGS	0	0	32, 5, 7, 6, 12
Mitochondrial helicase suv3	3RC3	12	0	14, 11, 33

Table 1 Missing oligopeptide in the crystal structure derived sequence

Protein	PDB Id	Missing oligopeptide	Proline residues	Glycine residues	Charged residue	Polar uncharged	Hydro-phobic	Total amino acid
Saicar synthase	1A48	KAEQGEH	0	1	4	1	2	7
Saicar synthase	2CNQ	EQG	0	1	1	1	1	3
Lipoate-protein ligase A	2POL	ERK	0	0	3	0	0	3
P450 pyr hydroxylase	3RWL	QKGGDGG	0	4	2	1	4	7
Hydorxymethylbilane synthase IUAG	2YPN	TROVILDTPLAKGGK	1	3	5	2	10	17
LTDP-n-acetylinurassioyl- L-alanine: D-glutamate ligase	IUAG	GADER	0	1	3	0	2	5
"	"	HQQG	0	1	1	2	1	4
Glycinamide ribonucleotide synthetase	1 GSO	DOL AAG	0	2	1	0	5	6
FoIypolyglutamate synthetase	IFGS	KT	0	0	1	1	0	2
"	"	IGGDT	0	2	1	1	3	5
"	"	HQKLLGH	0	1	3	1	3	7
"	"	ILADKD	0	0	3	0	3	7
"	"	ALPEAGYEALHE	1	1	4	0	7	12
Mitochondrial Helicase suv 3	3 RC 3	GPSADGDVGAELTR	0	3	4	2	8	14
"	"	PSINEKGEREL	1	1	5	5	4	11

Table 2 Distribution of missing residues in crystal structure

Note: only one aromatic residue (tyrosine) was seen in the internal missing oligopeptide string in IFGS.

The secondary structures predicted for internal missing oligopeptide and their flanking tripeptides from both FASTA and PyMol (crystal) formats are shown in Table 3. Surprisingly, 10 out of 16 internal missing oligopeptides form the disordered domains (DD). Among the remaining 6, two disordered domains adjoin terminal residue from β sheet, one adjoins α helix and 3 are from putative helix. In the tripeptides flanking the internal missing stings, we find that, at N-terminal, 10 out of 16 forms IDD, 3 form beta sheets 2 are from α helix and 1 form a junction between beta sheet and random coil. In the C terminal tripeptide, 7 are from disordered domain, 3 form a junction between disorder domain and α helix, 2 β sheet- DD junctions and 2 each from β sheet and α- helix. Thus, clearly, all missing strings are part of original disordered domains

Protein name	Protein Id	Missing oligopeptide	Secondary Seconda structure of residues after folding FASTA sequence in 1Pred4
			missing	Trip eptide flanking the missing oligopeptide region
			missing	N terminal	C terminal
SAICAR synthase	IA48	KAEQGEH	random coil	random coil	random coil
SAICAR synthase	2CHQ	EQG	random coil	random coil	random coil
Lipoate-protein ligase A	2POL	ERK	random coil	random coil	random coil and a helix
P450 pyrhyd roxylase	3RWL	QKGGDGG	random coil	random coil	random coil
Hydrox-ymethylbilane svnthase	2 YP	TRG DVILDTPLAKVGGK	3 sheet and random coil	3 sheet	random coil and a helix
UDP-n-acetylmuramoyl L-alanine D-glutamate ligase	1UAG	GAD ER	13 sheet and random coil	a helix	0 sheet
"	"	HQQG	random coil	13 sheet	5 sheet
Glycinamide ribonucleotide	svnthetase IGSO	DGLAAG	random coil	0 sheet and random c oil	random coil
"	"	DDE	random coil	random coil	random coil and 3 sheet
Folypolyglutamate synthetase	IFGS	KT	random coil	random coil	random coil and a helix
"	"	IGGDT	a helix and random coil	a helix	random coil
"	"	HQKLLGH	a helix and random coil	random coil	a helix
	„	ILADKD	random coil	0 sheet	a helix
		ALPEAGYEALHE	a helix and random coil	random coil	random coil
Mitochondrial Helicase suv 3	3RC3	GPSADGDVGAELTR	random coil	random coil	random coil
"	"	PSINEKGEREL	3 sheet and random coil	random coil	13 sheet and random coil

Table 3 Predicted secondary structure of the internal missing oligopeptide and the flanking residues

Comparing panoramas of secondary structures derived from the crystal structure to those computed by folding sequences from both, mature proteins and crystals with JPred4, we find (Table 4) that for each type of secondary structure crystals give an underestimate of the number of disordered domains as well as the number of residues therein. Indeed, a combined analysis of 9 proteins reveal the ratio (number. of secondary structure domains: number of. amino acid residues) is comparable for α helices and β sheets, but substantially reduced for disordered domains in crystals than in silico folded mature protein. Similar results were obtained by folding with PSSpred (not shown).

Protein name	PBD ID	Source	Alpha Helix [no. of motifs (no. of ammo acids)]			Beta sheets [no. of motiffs(no. of ammo acids)]			Random coils [no. of motiffs(no. of ammo acids)]
			mature protein	crystal structure		mature protein	crystal structure		mature protein	crystal structure
			JPred4	derived	JPred4	Jpred	derived	JPred4	JPred4	derived	JPred4
Saicar synthase	1A48	Saccharomyces cerevisiae ATCC 204508	6 (82)	7 (116)	7(85)	10 (56)	15 (106)	8 (53)	17(168)	20 (84)	16(160)
Saicar synthase	2CNQ	Saccharomycescerevisiae ATCC 204508	6 (82)	6 (127)	5(82)	10 (56)	15 (69)	10 (57)	17(168)	17 (106)	16(163)
Lipoate protein ligase A	2POL	Streptococcus agalactiae	10 (105)	9 (118)	8(93)	10 (57)	11 (55)	10 (59)	21(126)	20 (93)	9(114)
P450 pyr ITIroxylase	3RWL	Sphingopyxismacrogoltabida	14 (76)	14 (234)	14(177)	6 (36)	12 (40)	6 (34)	9(214)	24 (130)	20 (193)
Hydroxymethylbilane synthase	2YPN	Escherichia coli K12	8 (112)	11 (112)	8 (113)	11 (69)	13 (76)	11(62)	20(132)	21 (106)	20(119)
UDP-n-acetylmuramoyl L-alanine D-glutamate ligase	IUAG	Escherichia coli K12	15 (161)	20 (161)	20(152)	17 (83)	20 (89)	20(88)	32(193)	38 (178)	33(188)
Glycinamide ribonucleotide synthetase	1GSO	Escherichia coli K12	11 (127)	16 (128)	12(130)	20 (97)	16 (99)	12(99)	12(99)	32(207)	33 (192)
Folypolyglutamate synthetase	1FGS	Lactobacillus casei	16 (192)	15 (172)	13(166)	14 (67)	16 (62)	13(70)	31(169)	29 (159)	27(157)
Mitochondrial helicase suv3	3RC3	Homo sapiens	31 (441)	26 (444)	26(366)	13 (60)	16 (69)	14(65)	43(176)	42 (164)	37(246)

Table 4 Secondary structure from the mature protein sequence and crystal structure derived sequence folded using Jpred4 and the original crystal sequence

We find that only 2 crystals reveal histidine-rich oligopeptides at N or C terminal and none in the internal missing oligopeptide strings (data not shown).

Table 2 lists the relative distribution of Proline, Glycine, charged and hydrophobic residues in the internal missing strings. Thus, there is a high concentration of flexible (Glycine, 20/107) and charged residues (43/107) in these strings, while rigid Proline is of rare occurrence (3/107). Similarly, there is only 1 aromatic residue in the missing strings (not shown).

Discussion

According to Djinovic-Carrugo & Carrugo,¹ most crystallographic data reveal incidence of internal missing strings of oligopeptides. Here we describe in detail 9 such strings and analyses of their position in the overall panorama of secondary structure domains of a polypeptide sequence. To study this aspect we have adopted the strategy of folding in silico sequences for the same protein representing the post-translationally processed polypeptide and that derived from the crystal. The issue here is that when the crystal structure is obtained at low resolution, a number of residues fail to be detected due to low electron density. Therefore, by comparing two amino acid sequences of the same protein, we find the missing residues missing in crystal-derived sequence.

Comparison of the panorama of secondary structure domains revealed that after folding the sequences in silico, allowed us to detect secondary structure domains to which the missing residue belong and we conclude that most are disordered domains. This is further supported by the fact that these are rich in flexible amino acid Glycine and poor in rigid Proline. We find that out of the 16 cases, the proportion of hydrophobic residue is less than 0.5 in 10 cases and in remaining, it is less than 0.6. These disordered oligopeptides contain high concentration of charged residues and nearly 20% glycines. We conclude that the apparent loss or delectability in crystals of large internal oligopeptide strings involve highly disordered domains which probably accounts for the difficulty crystallographers face in designating a signature domain to the missing internal string. In fact, since these strings are not actually absent in polypeptides, the inability to detect leads to an incomplete crystal structure. Clearly, in most cases the problem can be solved by comparing the in silico folded amino acid sequences of mature proteins to those derived from crystals.

Finally, one must consider the structural and functional relevance of the apparently missing segments. To that effect we are now assessing the propensity of various Triads involved in defining functionally important sites for enzyme-substrate interactions as well as other protein: protein binding. Another possible approach is to examine the missing oligopeptides alone and with flanking regions in Ramachandran plots. The question, therefore, remains as to how one should solve the crystal structure beyond the offerings of crystallography. In any case, it is unlikely that proteins exist in crystalline form in vivo and probably exhibit a metastable state with variable mobility of flexible regions depending on the intracellular environment.

Indeed, polymeric structures, namely, micelles, membranes and globular proteins exhibit a hydrophobic core and hydrophilic exterior that are differentially sensitive to perturbations by osmotic pressure, ionic strength and temperature and exhibit differential movements such that the kinetic energy between the two domains is conserved.