Read Arrival of the Fittest: Solving Evolution's Greatest Puzzle Online
Authors: Andreas Wagner
In chapter 3, we saw how nature continues to create ever-novel sequences of chemical reactions, by combining and recombining metabolic enzymes through horizontal gene transfer. But that is not how metabolic enzymes themselves first appeared. As the last few examples showed, nature creates new proteins, including every one of the known five-thousand-plus enzymes, by altering the amino acid sequence of their protein ancestors. That’s also how it created the countless proteins that regulate genes, ferry materials, contract muscles, transport oxygen, import nutrients, export waste, communicate between cells, and perform a thousand other tasks. Entire books could be written—have been written—that describe a few such innovations in great detail.
This book is not among them.
You cannot understand what made all these innovations possible through anecdotes—an antifreeze protein here, an opsin there—any more than you can draw a map of the Unites States with satellite images of a few counties. The task requires comparing many old proteins and the new ones they brought forth. Thousands of them.
This task is made easier if one can read the DNA of genes or the amino acid strings they encode—the genotypes of proteins.
23
Among the first learning to read both was the British biochemist Frederick Sanger, one of few scientists to win two Nobel Prizes—the first for deciphering the amino acid sequence of insulin, the second for techniques to read the letter sequence of DNA. His discoveries came decades earlier than our ability to read the genotypes of metabolisms, and we therefore know many more protein genotypes and phenotypes.
24
They hail from organisms that live in Arctic wastelands and tropical jungles, on mountaintops and in ocean depths, in our gut and in boiling hot springs, in barren deserts and in fertile soil, in filthy sewers and in pristine rivers.
Without organization, this giant heap of protein facts would be like a million shuffled words in a madman’s dictionary, but once organized, it becomes part of a library just like the gigantic metabolic library from chapter 3. The volumes in
this
universal library are protein genotypes, texts written in a twenty-letter alphabet, where each letter corresponds to one amino acid. The universal protein library is the collection of all proteins that life has created, and all proteins that it
could
create. It is sometimes also called a
protein space
or a
sequence space
—because each text corresponds to a single sequence of amino acids.
25
The size of this library is no less staggering than that of the metabolic library, as an already familiar calculation helps us see. Recall that there are 20 × 20, or 400, possible two-letter texts using one of twenty possible amino acids. Similarly, there are 20 × 20 × 20, or 8,000, texts of three amino acids, 160,000 texts of four amino acids, and so on. Short texts like this are called peptides, but most proteins comprise much longer texts—polypeptides—and the number of possible amino acid texts explodes with their length, such that the number of proteins with merely a hundred amino acids is already greater than a 1 with 130 trailing zeroes. But the library is larger than even this unimaginably large number, because proteins like sucrase have more than a thousand amino acids, and some human proteins are many times longer. (Among them is a behemoth called
titin,
a 30,000-amino-acid-long protein spring in our muscles.)
26
The universal library of proteins is another library of hyperastronomical size.
The similarity to metabolism does not end with the size of this library. Like the metabolic library, the protein library is a high-dimensional cube, with similar texts near one another. Each protein text perches on one vertex of this hypercube, and just like in the metabolic library, each protein has many immediate neighbors, proteins that differ from it in exactly one letter and that occupy adjacent corners of the hypercube.
27
If you wanted to change the first of the amino acids in a protein comprising merely a hundred amino acids, you would have nineteen other amino acids to choose from, yielding nineteen neighbors that differ from the protein in the first amino acid. By the same process, the protein has nineteen neighbors that differ from it in the second amino acid, nineteen neighbors that differ from it in the third, the fourth, the fifth, and all the way through the hundredth amino acid. So all in all, our protein has 100 × 19 or 1,900 immediate neighbors. A neighborhood like this is already large, and it would be even larger if you changed not one but two or more amino acids. Clearly, this can’t be bad for innovation: With one or a few amino acid changes, evolution can explore many proteins.
In another parallel to the metabolic library, you would get lost wandering through
this
library’s maze unless you had an unrolling skein of wool to gauge how far you traveled. Once again, a notion of
distance
serves this purpose. It is the number of amino acids by which two proteins differ. It tells you how far you need to walk—how many amino acids you need to change—to travel from one protein text to any other.
28
The texts in this library are important, but even more important is the meaning each one carries. Our eyes cannot read this meaning, the words, sentences, and paragraphs of a protein’s chemical language, but life is fluent in this language. And it can tell whether a protein is meaningful or embodies jumbled chemical ramblings.
Cells take a hard-nosed view on which proteins are meaningful: those that help them live. A protein is meaningful only if it is useful, and defective mutant proteins that do not fold properly have lost their meaning. If a protein’s “meaning” feels too anthropocentric a word, it is worth reminding ourselves how “meaning” is defined by semiotics, an offshoot of linguistics that explores the meaning of meaning: whatever a sign—which could be anything from a road sign to a book’s text—points to. If that sign is a protein’s amino acid text, then the meaning it encodes is the protein’s phenotype and the function it serves inside a cell.
29
We still do not know how many meaningful books a universal library of books would contain, but decades of research allow us to estimate this number for proteins, because most useful proteins fold into a specific shape. If you blindly took a random protein from a random shelf in the library, the odds that it folds are at least one in ten thousand. That may not seem much, but keep in mind how vast the library is, containing more than 10
130
proteins of a hundred amino acids. Even if only one in ten thousand of them folds, you are still left with 10
126
proteins, a 1 trailed by 126 zeroes, much greater than the number of hydrogen atoms in the universe. The number of meaningful proteins is itself large beyond imagination.
30
Evolution explores the protein library through huge populations of organisms. Their proteins change, one amino acid at a time, with the occasional copying errors that alter a DNA string’s letters—A to C, T to G, or in any other way—as this string replicates generation after generation. To understand how such change creates texts with new and useful meaning, we need to map the protein library like we mapped the metabolic library. This is less difficult than it seems: Thanks to decades of work by armies of protein scientists, we know the folds and functions of tens of thousands of proteins and their place in the library. What’s more, the technologies of twentieth-century molecular biology allow us to take any volume off its shelf—to manufacture any protein—and study its fold and function in the laboratory.
The simplest question about innovability in proteins is one we encountered before. How hard is it to find a protein with any one meaning, one whose function helps an organism to survive? If there is only one of it in the library, even the eons elapsed since the Big Bang would not suffice to find it. Since meaningful proteins exist in huge numbers, just about every problem that life solved with a protein innovation must have more than one solution. But how many?
In 2001, Anthony Keefe and Jack Szostak from Harvard University set out to answer this question for a family of proteins whose invention was as crucial as any in life’s history: the proteins that can bind the ATP that we encountered as life’s battery in chapter 2. Proteins that carry out work—they transport materials, contract muscles, build new molecules—cleave ATP, and in doing so, harness its energy for this work.
31
To use ATP’s energy, a protein first needs to bind ATP. If only one protein in the vast protein library were able to bind ATP, then searching blindly for it would be futile. Its discovery would require a miracle. To find out how rare ATP binding proteins are in the library, Keefe and Szostak used a chemical technology that can create many different proteins, each one with a different and completely random amino acid string, a process equivalent to pulling random volumes from the shelves of the protein library. The random proteins these researchers generated were all eighty amino acids long. Because there are more than 10
104
such proteins, no experiment could create all of them, but this one created an impressive number, about 6 trillion, or 6 × 10
12
random proteins.
Keefe and Szostak found that four of them—unrelated to one another—can bind ATP. Four new ATP-binding proteins out of six trillion doesn’t sound like too many, but when the proportions are extrapolated to the number of potential candidates, the number is much larger. It comes out to more than 10
93
proteins—a 1 with 93 zeroes—that can bind ATP. The problem of binding ATP has astronomically many solutions.
32
John Reidhaar-Olson and Robert Sauer from the Massachusetts Institute of Technology approached the same problem from a different tack. They focused on a regulator protein that can shut down genes in a virus that infects bacteria. The DNA of this virus—bacteriophage lambda—encodes proteins that help it replicate and kill its host bacterium. But this virus can also remain dormant inside the bacterium, using this off switch to shut down its genes until the time is ripe to divide and kill the host. This time usually comes when the host falls on hard times—starved, poisoned by antibiotics, or irradiated with too much ultraviolet light. The virus then starts to replicate, and its children abandon the cell, rats scurrying from the proverbial sinking ship.
33
Reidhaar-Olson and Sauer explored a neighborhood of the protein library near this viral off switch, creating many random amino acid sequences in this neighborhood, and asked which of them produced a switch that works, one that can shut down the viral genes. From this information, they calculated that more than 10
50
texts in the library encode a working off switch. When they tried a similar approach on a different protein, an enzyme needed to synthesize amino acids, they found that some 10
96
amino acid strings can do this enzyme’s job.
34
Nature’s antifreeze proteins gave us a hint, and laboratory experiments like these prove it: Problems like that of binding ATP, shutting down a virus, or catalyzing a chemical reaction don’t have just one solution. Or even a million solutions. They have astronomically many solutions, each embodied by a different volume in the protein library.
35
To imagine the sheer number of these solutions is difficult, but that says more about the limits of our imagination than about life’s innovability.