Proteomics & Bioinformatics

Editorial

When Marc Wilkins coined the word "Proteomics" in 1994,¹ describing a global approach to identify and quantify all proteins in a cell, a tissue, an organism at a certain state, this idea was not really new. Already in the 1970s Norman Anderson used 2D gel electrophoresis to visualize the proteins in human plasma.^2,3 In this time Edman sequencing was the only method to identify proteins, comprehensive databases of any genome were not available, each sequenced gene could be published. Also in 1994, the technique to identify proteins still immature and only the genome of the yeast was available. In the same time the human genome sequencing became a major program of the US government, described under http://www.genome.gov/12011239. The sequence of the first individual human genome led to a big surprise, from originally over 1M expected proteins, only around 22,000 coding sequences were found in the human genome when finally finished around 2005. The rest of the genome was named "junk DNA", but nature does not generate energy wasting junk, the functions have yet to be discovered.⁴ In the meanwhile many more genomes from different species have been sequenced, also as many as 1,000s of individual human genomes.

Both Genomics and Proteomics gave big promises to revolutionise the development of drugs, to understand diseases, and to identify new biomarkers.

Proteomics wanted to answer the major question Genomics cannot answer: how biology works. But Proteomics is only one part of the bigger picture, to understand the biology all components have to be known, their interactions, and their role at a certain state of a biological system. This includes the genes, the RNA, the metabolites, the glycans, the lipids, the proteins together with their specific modifications and interactions, the special distribution of these in the system; an almost endless list.

With the techniques in the early 1990s these questions could never be answered, but these questions led to a rapid development of new methods allowing comprehensive analysis on the biological relevant level. One major example is the still ongoing development of mass spectrometry, from the invention of electrospray,⁵ over static nano-electrospray⁶ to state-of-the-art nano-liquid chromatography coupled to high resolution mass spectrometers^7,8 with the fully automatic fragmentation of peptides to identify the protein in databases. These developments allowed getting the first draft of the human Proteome, published in 2014.^9,10 But this draft Proteome does not give comprehensive answers to biological questions. Interestingly it revealed some proteins which are not predicted from the genome and some predicted proteins have not been found yet, but it did not tell anything about the actual active form of the protein itself as the post-translational modifications¹¹ were not covered. Neil Kelleher predicted over 1Billion active protein forms, called "Proteoform"¹² in the human organism. For example only the Histone H3 modifications allow over 50,000 different Proteoforms; few of these have been seen by today; many may not exist at all. But the published draft Proteome opens new avenues to look into biology and to draw knowledge which leads not only to deep understanding but may also fulfil promises given at the publication of the human Genome and at the rise of Proteomics. Big Pharma companies jumped on the bandwagon of Genomics and later Proteomics, soon recognizing that it does not result in new blockbuster drugs and does not explain the development of diseases in a way allowing to fight them. End 1990s many big Pharma companies invested in Proteomics facilities, closing most of them down in early 2000s. Old school drug development was continued, with a single protein as a target and 100,000s to millions of small molecules to test them in-vitro as mainly inhibitors. For a while the drug companies complained to have too many targets to follow, suddenly they run out of targets and only very few new drugs got approved. Many drug developments also failed due to safety reasons as toxic side effects or even because of inefficiency. The approach "one target/one disease/one drug" has failed in most cases. Now big Pharma industry is rethinking and looking into the phenotypic approach, the effect of small molecules, peptides, antibodies, natural compounds, proteins, to name only a few, on a whole biological system, a cell, an animal model, an organ. This is the point where Proteomics comes back into the game. To understand the biological effect of a developmental drug the targets have to be identified in an unbiased manner and their role in the biological system has to be well understood. Chemical Proteomics^13,14 allows this unbiased identification of the protein targets from complex biological systems, and also the off-targets and the reasons of toxicity. Interesting developments have been published in the last years and drawing attention from big Pharma to be included into their drug development pipelines. The dramatic teratogenic side effect of Thalidomide¹⁵ could be explained using Chemical Proteomics approaches; 50years after the drug was taken off the market.

Another promising, but generally failed field is the identification of relevant biomarkers for early diagnosis of diseases. While over 25,000 papers on protein biomarkers have been published, only very few made the way into the clinics. There are many reasons for this failure, as study design, cohort selection, number of individuals, the definition of the disease, the ethnicity, and the finally validation of the proposed biomarkers. Here a long used method made its way into Proteomics, the targeted mass spectrometry, also known as multiple or single reaction monitoring (MRM or SRM), or Western-Blot of mass spectrometry.¹⁶ MRM/SRM allows the detection and quantification of proteins in complex biological mixtures, as serum or urine, based on proteotypic tryptic peptides, representative for the targeted protein. With the now public available data from the human draft Proteome and from many other projects, these peptides can be selected for almost all human proteins.¹⁷ With state-of-the-art instruments 100s and even 1,000s of proteins can be detected and quantified in a single run in complex samples, overcoming the limitations of antibodies. Now proposed biomarkers can be validated in larger cohorts, first commercial companies offer these services already.

But there are still many open questions that require both, technical and computational bioinformatics developments.

From the over 600 known protein modifications only a few have been tackled in a larger scale, as phosphorylation.¹⁸ The challenge to analyse highly complex modifications, as glycosylation is not solved yet, many other modifications are not even tackled or showed to be inaccessible to current techniques. Here lies the big challenges for the future, understanding biology require understanding how the active proteins are modified and what the modification affects.

The other big challenges are the overwhelming amount of already available information. A lot of hidden knowledge lies here, difficult to excavate and turn it into actions. The data and information are spread over many databases, from literature to sequences. To combine these into an easy searchable manner will help researchers to plan meaningful experiments leading to new knowledge in understanding biology and finally to new insights into diseases.

Also the identification of proteins from mass spectrometric data, especially MS/MS data, requires new approaches. The commonly used MOWSE¹⁹ has shown many drawbacks; mutations, multiple modifications may get not detected using this approach. New approaches, as de-novo sequencing with consecutive database searches may overcome the limitations.²⁰

After a downtime of Proteomics in the early 2000s, due to unfulfilled promises, it is back to a new and promising rise. Many mistakes have been done, researches should learn out of them and avoid their repetition.

Marvin Vestal once stated (literally): "Genomics took 20years, Proteomics will take 200years" a very optimistic view.