Mini Review Volume 5 Issue 5
Lawrence Berkeley National Lab, DOE Joint Genome Institute, USA
Correspondence: Georgios A Pavlopoulos, Lawrence Berkeley National Lab, DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA
Received: April 04, 2017 | Published: June 2, 2017
Citation: Pavlopoulos GA. How to cluster protein sequences: tools, tips and commands. MOJ Proteomics Bioinform. 2017;5(5):158-160. DOI: 10.15406/mojpb.2017.05.00174
Proteins are the key molecules that facilitate most biological processes within a cell. Therefore, the discovery, annotation and characterization of them, is of great importance. In System Biology, protein clustering by sequence at a large-scale in order to detect homology, orthology, families, common domains or functional similarities is becoming a great challenge, especially when living in the -Omics era where the exponential growth of sequences produced is indisputable. Despite the great plethora of applications with different strengths to serve this purpose that is available today, a steep learning curve to get familiar with such tools is often required. Users often quit when they get lost in the README files prior to any analysis. To help the community overcome this hesitance, this article describes tools and ways to cluster proteins into groups or families and emphasizes on their basic commands that can be executed in a simple Unix terminal. Notably, both graph-based and sequence-based approaches are described.
Keywords: protein clustering, protein sequences, plasmids, genome
The protein landscape changes continuously as new and hypothetical proteins appear every day. IMG1 today hosts 55,482 Bacterial genomes, 1,580 Archaeal, 258 Eukaryotic, 1,222 Plasmids, 7,521 Viruses, 1,196 genome fragments and 14,265 private and public met genomes and meta transcriptomes. With a very approximate estimation, this corresponds to ~70Million non-redundant proteins at 100% similarity for the isolate side and ~3billion non-redundant proteins for the met genome/metatranscriptome side (coming from scaffolds of length ~500). Release 15-Feb-2017 of UniProtKB/TrEMBL2 contains 77,483,538 sequence entries. This number corresponds to 1,465,039 (2%) Archaeal proteins, 49,717,238 (64%) Bacterial proteins, 22,299,253 (29%) Eukaryotic proteins, 2,918,867 (4%) Viral proteins and 1,083,141 (<1%) others. Moreover, Uniparc3 contains 148,791,725 protein entries. The UniProt Archive (UniParc) is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Protein families can be characterized by molecules which share significant sequence similarity.4 Notably, this biological problem is very difficult to solve and most available clustering techniques fail in the case of eukaryotic proteins, which contain large numbers of protein domains.5 Nevertheless, ongoing efforts in detecting the best and more accurate protein clustering are still a very active research field. PFAM6 version 31.0 for example, a database of a large collection of protein families, organizes proteins in families by similar domains and includes 16,712 entries. Several tools today, follow various methodologies and strategies to perform protein clustering.7 Outstanding tools such as the CD-HID,8 UCLUST,9 kClust10 and the newly developed MMSEQ/LinClust11 follow a k-mer and dynamic programming-based sequence alignment approach whereas tools such as the MCL12 clustering algorithm and others a network topology based clustering.13–18 In the second case, prior to clustering, a pairwise similarity matrix is required. While such similarities can be calculated in various ways, BLAST+19 and LAST20 are the most widely used. In this article, in order to encourage users getting familiar with several tools and avoid troubleshooting, simple command lines to perform such analyses are provided.
Input File
Prior to any clustering, organization of protein sequences organized in a FASTA file format is required.
Sequence-based clustering
CD-HIT: It clusters proteins into clusters that meet a user-defined similarity threshold. Each cluster has one representative sequence. The input is a protein dataset in fasta format and the output produces two files: a fasta file of representative sequences (centroids) and a text file (.clstr) with a list of clusters. The basic command to cluster the proteins in the example. fasta file with similarity down to 50% is:
cd-hit -example. fasta -o clusters -c 0.5 -n 2.
UCLUST: Given a user defined identity threshold, the UCLUST algorithm divides a set of sequences into clusters and is more effective in clustering proteins down to 50% similarity. The steps are:
uclust --sort example. fasta --output seqs_sorted. fasta
uclust --input seqs_sorted.fasta --ucresults. uc --id 0.5
KCLUST: It is a method to cluster large protein sequence databases such as UniProt within days. It can cluster proteins down to 20%-30% maximum pairwise sequence identity. For example, to cluster a set of proteins proteins down to 50% identity, the basic command is:
kClust -iexample.fasta -d tmp –s 0.5.
KCLUST will create a /tmp folder with the clustering results in it. The headers.dmp file will contain the index for each protein, and the clusters.dmp file the clustering results in a two-column format. The first column contains the index of centroid sequence for each cluster whereas the second column the index of every member of this cluster.
USEARCH: It is a unique sequence analysis tool with thousands of users worldwide. USEARCH offers search and clustering algorithms that are often orders of magnitude faster than BLAST. A typical command to run USEARCH would be:
usearch -cluster_fastexample.fasta -id 0.5 -centroids out.fasta –ucclusters.uc
In this case, example. fasta is the input file whereas out. fastacontains the centroids for each cluster and clusters. uc the final clusters.
MMSEQ/LINCLUST: It claims to be able to cluster billion proteins down to 50% sequence identity in two days. It runs in linear time and has been tested against UCLUST and CD-HIT. The basic commands are:
mmseqs createdb example.fasta DB
mkdir tmp
mmseqs cluster/linclust DB clu tmp --min-seq-id 0.5 --target-cov 0.5
mmseqs createtsv DB DB clu clu.tsv
All results are stored in clu.tsv file. The first column represents the centroid of each cluster whereas the second column every member of this cluster. Thus by using command: cut -f1 clu.tsv | wc -l
one can count the number of clusters.
Graph-based clustering
Another approach to cluster proteins into groups is to take advantage of the topological features of a constructed network containing all pair wise similarities in a binary form. To do that, BLAST+ and LAST tools are recommended. BLAST+ does not scale for huge datasets whereas LAST and mega blast21 tools do. BLAST+ is more sensitive though. For smaller datasets and for much higher quality results at the cost of a much slower speed, ssearch36 or glsearch36 implementation of Smith-Waterman algorithm is recommended.
Similarities
LAST: Step 1: Database building. Assuming that one has a fast a file called example. fasta. First users need to build a database (DB) with the following command
Last db -p -C 2 DB example. fasta.
Step 2: A pair wise similarity matrix must be created and used as an input for graph-based clustering. A typical representation would be a tab-delimited format such as:
Protein A Protein B 0.3
Protein A Protein C 0.2
Protein B Protein D 0.3…..
By using last, in order to create such matrix holding information about proteins down to 50% identity the command must look like:
lastal –m200 -pBLOSUM62 -P 0 -f blasttab DB example.fasta | awk '{if($1!~/^#/ && $3>=50) print $1"\t"$2"\t"($3/100)}' >hits.list
The hits.list will be a 3-column tab delimited file (1st column: source, 2nd column: target, 3rd column: identity).
Another way to create an adjacency similarity matrix is to use BLASTP instead of LAST.
Step 1: Users must create a blast database using the command: makeblastdb -in example.fasta -dbtype 'prot' -out DB
Step 2: The binary adjacency list can be created by querying the DB database like: blastp -query example.fasta -db DB –out fmt '6 std qlenslen' -out output_tmp.list -evalue 1.0e-05
Hits are firstly stored in a BLASTAB file format called output_tmp.list
Step 3: Similarly to the previous example, users can create the similarity matrix, which will be used as a graph clustering input like:
awk '{if($1!~/^#/ && $3>=50) print $1"\t"$2"\t"($3/100)}' output_tmp.list>hits.list
Note that if one wants to take into account the best hits only, then he/she might want to filter the file by best similarity first like:
sort -nrk 3 hits.list | sort -k 1,2 –u >best_hits.list
Network clustering
While many graph clustering algorithms exist, MCL algorithm is by far the most widely used for protein clustering. Notably, MCL is inflation-value sensitive and often memory greedy when it comes to Millions of sequences. To Run MCL with inflation value 1.8, a typical command is:
mcl hits.list –I 1.8 --abc -o clusters.txt
Clusters are stored in clusters.txt. Each line is a cluster and the members in each line are tab separated. Thus, the number of lines corresponds to the number of clusters. To filter singletons and keep the clusters with two or more members for example, users can filter as: awk '{if(NF>=2) print}' clusters.txt >clusters_after_threshold.txt
Similarly to MCL, the highly scalable SPICi density-based clustering algorithm22 can be utilized like:
spici -ihits.list -o out.spici
Density is by default set to 0.5. To keep members of clusters greater than a number, the –s parameter must be used.
Cluster centroids
For the network-based approach, a potential centroid of a cluster could be determined by the number of hits within a cluster above a certain similarity.
In this article, the very basic commands from various tools to perform protein sequence clustering are presented. While this guide might be of a great help for many users, the commands should not be used blindly as all tools are sensitive upon parameterization and results can vary significantly. USEARCH, UCLUST and CD-HIT should perform nearly the same algorithm. They all cluster the shorter sequences to the longer if the global alignment identity >0.5% is fulfilled. However, kClust, MMSeqs and LAST compute sequence identity in a local way. In order to directly compare clustering between each other, Rand Index, Variation of Information, F-Score and other metrics can be used. Such metrics are extensively explained elsewhere.23
This work was partially supported by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under Contract No. DE-AC02-05CH11231.
The author declares no conflict of interest.
©2017 Pavlopoulos. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.