An Overview of the Protein Thermostability Prediction: Databases and Tools

doi:10.15406/jnmr.2016.03.00072

Journal of

eISSN: 2377-4282

Nanomedicine Research

Review Article Volume 3 Issue 6

An Overview of the Protein Thermostability Prediction: Databases and Tools

Maryam Mahmoudi,¹ Seyyed Shahriar Arab,¹

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Javad Zahiri,² Yasaman Parandian²

¹Department of Biophysics, Tarbiat Modares University (TMU), Iran
²Bioinformatics and Computational Omics Lab (BioCOOL), Trabiat Modares University (TMU), Iran

Correspondence: Seyyed Shahriar Arab, Department of Biophysics, School of Biological Sciences, Tarbiat Modares University (TMU), Tehran, Iran

Received: March 12, 2016 | Published: July 5, 2016

Citation: Mahmoudi M, Arab AA, Zahiri J, Parandian Y (2016) An Overview of the Protein Thermostability Prediction: Databases and Tools. J Nanomed Res 3(6): 00072. DOI: 10.15406/jnmr.2016.03.00072

Download PDF

Abstract

Thermophilic proteins are characterized as high thermal stability proteins while mesophilic proteins are stable at lower temperatures. These types of proteins have numerous applications regarding protein engineering, drug design and industrial processes. Studies showed that thermal stability is strongly related to structural and sequential properties in thermophilic proteins. Some computational studies were being taken to identify the mentioned properties in heat resistant proteins. This paper reviews the studies of protein thermostability prediction and gives an introduction to the thermal stability related tools and databases.

Keywords:Rotein thermostability, Thermophilic proteins, Mesophilic proteins, Databases, computational methods, Bioinformatics

Introduction

Environmental temperature plays an important role in the cell life.¹ There are four classes of organism in relation to their optimal growth temperature namely hyperthermophile (>80◦C), thermophile (45-80◦C), mesophile (20-45◦C) and psychrophile (<20 ◦C).² Thermal stability is defined as the ability of material to resist changes in physical structure or chemical irreversibility, or spatial structure stability of polypeptide chains at high temperatures.³ Studies showed that thermal stability of thermophilic proteins is related to a series of protein sequential and structural properties.⁴ A small number of these mentioned properties are going to be introduced in this paper. Also, the amino acid compositions difference had been studied in mesophilic and thermophilic proteins.^3,5-7 For instance, Zhang and Gromiha research shows that Lys, Arg, Glu and Pro were higher and Ser, Met, Asp and Thr were lower in number of thermophilic than the of mesophilic proteins number .^6,8 (Figure 1). Protein secondary structure stability like alpha-helix is considered as a necessary factor for thermal stability.⁶ Studies suggested that thermal-stability is increased by certain characteristics in proteins. These characteristics are: increased number of hydrogen bonds.⁷ salt bridges, ion pairs .⁹ aromatic clusters.⁸ sidechain-sidechain interactions, electrostatic interactions of charged residues .⁹ and hydrophobic interactions.⁵

Figure 1 Comparison of Amino acid composition in thermophilic and mesophilic proteins.^6,8

Protein’s Thermal Stability Prediction Methods

Protein’s thermal stability can be predicted based on sequence or structure. Both mentioned methods and their corresponding advantages and limitations have been discussed here in further detail. Table 1 demonstrates an overview of the thermal stability prediction methods.

Sequence/Structure Feature	Algorithm	Reference

Amino acid sequence	Support vector machine	¹⁰
Primary structure	LogitBoost	¹²
Amino acid sequence and residues and dipeptide composition	Neural network	⁸
Primary, secondary and tertiary structure information	Decision tree	¹¹
Amino acid distribution and dipeptide composation	Support vector machine	¹³
Amino acid composition-based similarity distance	KNN-ID	¹
Dipeptide composation	Statistical Methods	¹⁴
Amino acid sequence	Genetic Algorithm	¹⁵
Thermodynamic parameters	Statistical Potentials	¹⁶

Table 1 An overview of protein thermostability prediction studies.

Sequence based prediction

This method utilizes sequence information of proteins; for instance, distribution of amino acid and di-peptide composition for discrimination of thermophilic and mesophilic proteins. Studies revealed the differences between amino acid and di-peptide composition in thermophilic and mesophilic proteins. For example, the frequency of Lys, Arg, Glu and Pro was higher in thermophilic than mesophilic protein.^8,10. These studies also show that the occurrences of EE, KK, RR, PP, KI, VV, VE, KE, and VK were higher in thermophilic proteins while QQ, AA, EQ, LL, NN, QT had lower occurrences.⁶ In addition, the frequency of charged, hydrophobic and aromatic amino acids in thermophilic protein is higher than mesophilic ones.³ Moreover, the correlation between protein amino acid composition and its biological function has been proven.¹ So, the protein sequence analysis provides valuable information to predict protein thermostability; particularly whenever the structural information of proteins is not available.

Structure based prediction

The studies of protein thermostability prediction are based on protein structures utilized protein secondary and tertiary information for discrimination of thermophilic and mesophilic proteins. Important features considered in this studies include amount of secondary structure, ion pairs, hydrogen bonds, disulfide bonds and accessible surface area.¹¹ Although the thermal stability is directly related to the protein structure stability .¹¹ Regarding the fact that structural and sequential features affect the thermal stability, applying the both mentioned features at the same time leads to a more accurate, precise prediction. The protein structural information may not be always available; This restrains structure based protein thermostability prediction.

Protein’s thermal stability prediction procedure: Several machine learning methods have been applied to predict protein thermostability. Here, we briefly review these methods. Figure 2 provides an illustration of these methods. As illustrated in the figure, in order to predict the thermal stability of proteins, at first, a dataset of thermophil and mesophil proteins is collected from the related databases. Then, proteins are analyzed based on their sequential and structural characteristics. The goal in this stage is to select those features which are significantly important regarding protein thermostability prediction. It should be noted here that considering the structural and sequential features at the same time can produce more precise results. In the next stage, the dataset is going to be divided into the train and test datasets. The train dataset is then used for learning the machine learning algorithm while the test dataset is used to evaluate the model.

Figure 2 Thermal stability prediction procedure.

Prediction algorithms based on machine learning methods

The following section introduces a few machine learning algorithms. The selected algorithm is going to distinguish the thermophile from mesophile proteins.

Support vector machines (SVMs): Support vector machines is an machine learning method for classification two classes of data and many kind of kernel functions can be used for classification in this algorithm.¹⁷
Artificial Neural Networks (ANN): The ANN concept is inspired by the neural structure of the brain. In this model of prediction, the system is supposed to learn from data - a large number of inputs and solve a wide variety of tasks. ANN software packages can be downloaded from Open NN (Available online: http://www.cimne.com/flood/download.asp).¹⁵
Decision Tree: A decision tree is popular machine learning algorithm in bioinformatics and computational biology. It uses a tree-like graph or model of decisions and their possible consequences to classify input instances.

Performance measures

Assessing a prediction tool is a critical task. Table 2 describes commonly used measures for performance prediction assessment: accuracy, sensitivity, specificity, strength, MCC, precision, F-measure and area under the ROC curve (AUC). These measures based on the following four basic parameters:

Expression	A brief description
$A c c u r a c y = \frac{TP + TN}{TP + TN + FP + FN}$	percent of correct prediction
$S e n s i t i v i t y = \frac{T P}{T P + F N}$	percent of correctly predicted positive
$S p e c i f i c i t y = \frac{TN}{TN + FP}$	percent of correctly predicted negative
$P r e c i s i o n = \frac{T P}{T P + F P}$	Positive Predictive Value
$F - m e a s u r e = \frac{2 \times P r e s i o n \times S e n s i t i v i t y}{P r e s i o n + S e n s i t i v i t y}$	The harmonic mean of sensitivity and specificity

Table 2 Commonly used measures for performance assessment in protein thermostability prediction.

True positive (TP):The number of thermophile proteins, which have been correctly predicted as thermophile.
True negative (TN): The number of mesophile proteins, which have been correctly predicted by the prediction method as mesophile.
False positive (FP): The number of mesophile proteins, which have been incorrectly predicted as thermophile.
False negative (FN): The number of thermophile proteins, which have been incorrectly predicted by the prediction method as mesophile.

Databases

To build a model capable of predicting the proteins thermal stability; at first, a dataset is created using the related databases. This dataset contains information about the structure and sequence of thermophilic and mesophilic proteins. Table 3 describes a few databases that have been used in studies of protein’s thermal stability prediction. According to Table 3, PGT and ProTherm DBs are specifically used to predict the thermal stability. PDB database is used to extract structural information while Uniport gives the sequential information of thermophilic and mesophilic proteins.

	Data bases	Note	Ref. Num
General Databases	UniProt	The Universal Protein Resource (UniProt) provides a stable, comprehensive, freely accessible, central resource on protein sequences and functional annotation. This DB is used to extract the sequential information of thermophilic and mesophilic proteins. Availability: http://www.uniprot.org.	18
General Databases	PDB	The Protein Data Bank contains information of the 3D structures of large biological molecules, including proteins and nucleic acids. This DB is used to extract structural information of thermophilic and mesophilic proteins. Availability: http://www.rcsb.org.	19
Specific Databases	Pro Therm	ProTherm is a thermodynamic database that contains experimentally determined thermodynamic parameters of protein stability. This DB is specifically used to predict the thermal stability. Availability: http://gibk26.bse.kyutech.ac.jp/jouhou/Protherm/protherm.htm	20
Specific Databases	PGT	PGT contains Prokaryotic Growth Temperature database (PGTdb). This DB is specifically used to predict the thermal stability. Availability: http://pgtdb.csie.ncu.edu.tw	2

Table 3: List of databases in protein thermostability prediction.

Conclusion

Due to the recent pervasive use of thermostable proteins and enzymes in industry, protein engineering and other theoretical/experimental studies play a significant role in identification of protein thermal stability. Regarding the high expense rate of laboratory procedures, the employment of theoretical methods for predicting the thermal stability with high accuracy could be so helpful. So far, most computational thermophilic and mesophilic protein identification studies have been solely based on the protein sequence. Regarding the fact that both structural and sequential features affect the thermal stability, applying the both mentioned features at the same time leads to a more accurate, precise prediction.