Research Article Volume 1 Issue 2
1Department of CSE, Jawaharlal Nehru Technological University, India
2Department of CSE, Acharya Nagarjuna University, India
3Department of CSE, University College of Engineering, India
Correspondence: Pokkuluri Kiran Sree, Department of CSE, Jawaharlal Nehru Technological University, Hyderabad, India, Tel +919493050794
Received: April 11, 2014 | Published: June 30, 2014
Citation: Sree PK, Babu IR, SSSN Usha Devi N. Cellular automata in splice site prediction. MOJ Proteomics Bioinform. 2014;1(2):49-52. DOI: 10.15406/mojpb.2014.01.00013
Splice site prediction is one of the important problems in Bioinformatics. Splicing is the way in which introns are removed from pre-mRNA transcript and exons are joined before translation. The position where the introns are spliced out is called as splice site. Identifying the splice junction plays vital role in understanding the genes. For an efficient study on eukaryotic genes the first step is to predict the splice site accurately. Accurate prediction of splice site will lead to accurate prediction of gene structure. There are three categories of splice site exist; they are acceptor site (AS), donor site (DS) and neither of these. The proposed classifier AIS-SSMACA has to take DNA sequence as input and predict the category (AS/DS/Neither).
Keywords: splicing junction, cellular automata, multiple attractor
AS, acceptor site; DS, donor site
Donor site exists at the start of an intron i.e. 5' towards left. Introns in the donor site frequently start with GT (dinucleotides). Acceptor site exists at the end of an intron i.e. 3' towards right. Introns in the acceptor site frequently end with AG (dinucleotides). The intron/exon borders are called as acceptors (Scanning form left), exon/intron borders are called as donors (Scanning from right) as shown in (Figure 1).
Many researchers have proposed various methods for predicting these splicing sites but the search for a good classifier with higher classifier accuracy is needed. We have reviewed the methodologies of the following well known splice site techniques, NNtree,1 Netgene2,2 HSPL,3 NNSplice,4 SpliceView5 and genesplicer.2
The datasets are extracted from Irvine Primate Splice junction database6 (http://archive.ics.uci.edu/ml/machine-learning-database). The data set consist of 3190 DNA sequences each of length 60. Among 3190 sequences, 25% sequences belong to donor site category, 25% sequences belong to acceptor site category and 50% sequences belong to neither of these.
The main aim of the learning algorithm is to encode the DNA in the multiples of three and produce an AIS-SSMACA with n-attractors, k cells and m classless. Since the input is of fixed length that is 60bp, the n value is fixed as 4, a k value is 3 and an m value is also three. At the end of the execution of the learning algorithm we will have set of basins which represent the classes.
Learning algorithm
Input: DNA sequence
Output: AIS-SSMACA tree with n attractor basins.
Step 1: Read the input DNA sequence and process the sequence in the multiples of three. (Three neighborhood CA
is used).
Step 2: Encode the input in the multiples of three.
Step 3: Choose a high fitness rule and apply it on the input to construct an n-attractor, k-cell, 3-class AIS-SSMACA.
Step 4: Store all the basins constructed, repeat steps 1, 2, 3 till n-attractors are stored.
Step 5: Stop.
Testing algorithm
The main aim of the testing algorithm is to distribute the corresponding input into the generated basins. During this process fitness, diversity of the intermediate node will be calculated for efficient development of the desired tree. Once the DNA sequence identifies the basin uniquely, we can report the class associated with the basin.
Input: DNA sequence
Output: DNA Class (Acceptor/Donor/Neither)
Step 1: Read the input DNA sequence and process the sequence in the multiples of three.
Step 2: Encode the input in the multiples of three (As shown per discussion in 5.4)
Step 3: Distribute the input into the generated AIS-SSMACA basins till the entire sequence falls into a attractor of the tree.
Step 4: Report the basin and corresponding class.
Step 5: Stop.
This section shows the output of the proposed classifier. AIS-SSMACA will take input as a DNA sequence and reports the splice sites in both the stands of the sequence. Input 1 shown below is processed by AIS-SSMACA and identifies donor sites, one in the forward strand and one in the reverse strand. Input 2 is processed by AIS-SSMACA and identifies acceptor site in the forward strand. Input 3 is processed by AIS-SSMACA and identifies the sequence belong to neither donor nor acceptor.
Sequence_human_Kiran_Splice_123jntuh, Human Splice Prediction |
Donor Site Prediction |
START END SCORE EXON INTRON |
Donor Site Prediction in Reverse Strand |
START END SCORE EXON INTRON |
Acceptor Site Prediction |
Nil |
Acceptor Site Prediction in Reverse Strand |
Nil |
Input 1 CCCAAGGCCAACCGCGAGAAGATGACCCAGGTGAGTGGCCCGCTACCTCTTCTGGTGGCC
Output:
# Sequence Sequence_human_Kiran_Splice_123jntuh=60bps
Sequence_human_Kiran_Splice_83jntuh, Human Splice Prediction |
Donor Site Prediction |
Nil |
Donor Site Prediction in Reverse Strand |
Nil |
Acceptor Site Prediction |
START END SCORE INTRON EXON |
Acceptor Site Prediction in Reverse Strand |
Nil |
Input 2 CTCCCTGATGCCCTCAGAATCTCCCCACAGGCCGCCTGATCTTTGACAACTTGAAGAAAT
Output:
# Sequence Sequence_human_Kiran_Splice_83jntuh=60bps
Sequence_human_Kiran_Splice_89jntuh, Human Splice Prediction |
Donor Site Prediction |
Nil |
Donor Site Prediction in Reverse Strand |
Nil |
Acceptor Site Prediction |
Nil |
Acceptor Site Prediction in Reverse Strand |
Nil |
Input 3 CCAGCAGGCTGAGGGCCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGA
Output:
# Sequence Sequence_human_Kiran_Splice_89jntuh=60bps
Extensive experiments are conducted to report the superiority of the AIS-SSMACA classifier when compared with the existing approaches like NNtree,1 Netgene2,2 HSPL,3 NNSplice,4 SpliceView5 and genesplicer2 is reported in section two. The analysis on the basic parameters of tree building like number of nodes, height of the tree and classification time is reported in Table 1.
Method |
Sensitivity |
Number of nodes |
Height of the tree |
Classification time(ms) |
AIS-SSMACA |
0.9695 |
4 |
3 |
400 |
NN Tree |
0.9348 |
5 |
3 |
515 |
C4.5 |
0.9012 |
12 |
12 |
668 |
Table 1 Performance of AIS-SSMACA
The most important strength of AIS-SSMACA splice site prediction is predicting the acceptor and donor sites, even the acceptor input do not contain AG and the donor site do not contain GT. Among 796 trained DNA sequences, to construct the desired AIS-SSMACA tree the average height of the tree constructed is 3. The number of nodes constructed to take a decision on the class of the DNA sequence is 3. The average time to report the class of the DNA sequence is 0.004 seconds as shown in Table 1.
We have three categories of classes to be identified, SeA calculation relates to donor site prediction, SeB relates to acceptor site prediction and SeN relates to neutral prediction. The sensitivity for identifying acceptor class with AIS-SSMACA is high (0.9695) and least for NNSplice (0.9256) due to the increased error rate in NNSplice. The sensitivity for identifying donor is high for genesplicer and least for Netgene2. The sensitivity for identifying neutral prediction is high for AIS-SSMACA and low for NNsplice. In an ideal splice site prediction the value of SeA+SeB+SeN is 3. AIS-SSMACA maintains good balance among SeA, SeB, SeN which produces a value 2.8827, which is highest among the compared methods as shown in Figure 2 and Table 2. After AIS-SSMACA genesplicer shows good balance among SeA, SeB, SeN, which produces a value 2.8742.
Methods |
SeA |
SeD |
SeN |
SeA+SeD+SeN |
AIS-SSMACA |
0.9695 |
0.9512 |
0.9620 |
2.8827 |
NNtree1 |
0.9348 |
0.9256 |
0.9306 |
2.7910 |
Netgene22 |
0.9312 |
0.8568 |
0.9263 |
2.7143 |
HSPL3 |
0.9494 |
0.9456 |
0.9503 |
2.8453 |
NNSplice4 |
0.9256 |
0.9587 |
0.9006 |
2.7849 |
Genesplicer2 |
0.9396 |
0.9562 |
0.9784 |
2.8742 |
SpliceView5 |
0.9489 |
0.9491 |
0.9300 |
2.8280 |
Table 2 Comparison of AIS-SSMACA with other methods
We have successfully developed a classifier AIS-SSMACA for predicting splice sites with an accuracy of 96.06%, which is promising for human DNA of lengths 60bp. It can predict the acceptor and donor sites, even the acceptor input do not contain AG and the donor site do not contain GT. The average numbers of nodes, height of the tree, classification time constructed to predict splice sits are 4, 3 and 400ms respectively. In future we wish to extend this for splice site prediction of various species with different lengths.
None.
The author declares no conflict of interest.
©2014 Sree, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.