Read Here Is a Human Being Online
Authors: Misha Angrist
As I watched Kevin’s technicians, Ryan Campbell and Linda Hong, go through the steps of sample preparation and bridge amplification, it struck me as an intense, monastic exercise: it demanded concentration and meticulousness, but was also repetitive and trance-inducing.
The next day we were ready to sequence. The lab techs made the DNA single-stranded again so that they could attach a primer specifically for sequencing. Campbell, a tall and shaggy guy, opened the blue Plexiglas door to the Genome Analyzer IIx, itself a big blue box. He complained about how difficult it was to get his hands into the machine to load the flow cell. He checked to see if the reagent tubes were pumping—the presence of bubbles meant the chemicals weren’t smoothly flowing in and out of the GAIIx. Inside were a glass prism and a camera that would image my DNA. The sequencer would take test photos and use that information to allow the user to focus the camera in three dimensions. “If it’s not focused, you’re screwed,” said Kevin. “Once it is, then you can go to town: you can start the sequencing cycles.” The first cycle of sequencing would incorporate a single fluorescent nucleotide in billions of reactions across the flow cell, followed by high-resolution imaging of the entire flow cell—this was what was meant by massively parallel sequencing. After the first cycle we looked at how it was going: the computer attached to the sequencer reported that we were getting about 180,000 clusters per tile—not bad. The machine would now repeat this seventy-five more times, one base at a time, generating a series of images with each one showing a single-base extension at a specific cluster. On the screen the images of clusters looked exactly like the images of polonies I had seen on the jerry-rigged computers in George’s lab in 2006: a starry night. After seventy-five cycles, the Shianna team would repeat the process another seventy-five times, starting this time from the
other
end of each DNA molecule. So-called paired end sequencing was more accurate and had become an industry standard. This process would take about three days.
36
After that, the sequencing was done; now it was up to three computer programs named after birds to read it. The first, Firecrest, converted the images into numbers. The second, Bustard, took each number and assigned a base—A, G, T, or C—to it. The third, Gerald, filtered the results and aligned them to the reference sequence put together by the Human Genome Project. Sequence deemed to be of high quality was kept; low-quality data were thrown away. Another program, the Burrows-Wheeler Aligner, would carry out more refined sequence alignments.
37
The Shianna lab managed to get about 80 billion of my base pairs at high quality: mine would be a 26x genome (80 billion base pairs $$ 3 billion base pairs per genome = a genome that has been read 26 times over). When one filtered out duplicate PCR reactions and other meaningless genomic detritus, it was about 23x. Not the shiniest genome ever sequenced and certainly a level of quality that would be scoffed at before too long, but pretty good given the 2009 state of the art. Generating my sequence at 25-fold redundancy took about a week from start to finish. “The GAIIx is so easy,” said Kevin. “Suddenly we’re a genome center, just like that.”
38
A few days later I went to the lab and pulled up a chair. Reggae pulsated in the background along with the white noise of the sequencers’ fans. I was there to discuss the gestalt of my sequencing runs with lead technician Jason Smith. He came to the Shianna lab in 2006; he was Kevin’s first employee. He had a ruffle of close-cropped, dark blond hair, glasses, and a beard. The serious academic look belied an easygoing nature, although he spoke faster than any other southerner I’ve ever met. Raised in Greenville, North Carolina, he had studied biology at North Carolina State. When I asked him how he became a next-generation sequencing guy, he shrugged and said it was mostly by chance. “I enjoy working with computers.”
39
I brought along the spreadsheet Jason had emailed to me, a blur of numbers and acronyms. “I always look at the error rate first,” he said. “Generally the most we can tolerate is two percent. Yours is a little high—right at two percent—because these defective reagents are causing a faster decay of the intensity.” He explained that the 2 percent figure was itself inflated because it included legitimate SNPs that, until they could be identified, would appear to the computer to be errors. Because Illumina read lengths were still a fairly short seventy-five base pairs, aligning them to a reference genome was trickier than with old-fashioned Sanger sequencing and its 800-base-pair reads, especially if there were too many errors (recall the jigsaw puzzle analogy). When I was sequenced, Illumina’s specs called for a 1.5 percent error rate or better.
40
Like most people, I had about 3–4 million SNPs, so the true error rate—bases that were actually “wrong” and not actual variants in my genome—would be well below 1 percent. And, like most everyone else, about 80 percent of my reads aligned to the reference sequence, that is, the composite genome that came out of the Human Genome Project circa 2003 and has been periodically tweaked and updated ever since.
*
41
What about the other 20 percent that didn’t align? I wondered. Is one-fifth of our DNA from extraterrestrials? Jason said the unaligned sequence was a combination of any number of things: true sequencing errors, uncalled bases the software couldn’t recognize (labeled simply “N”), or multiple SNPs in the 75-base-pair chunks spit out by the machine and which, again, the computer would think were errors and be reluctant to align with the reference sequence.
42
The software would reject bases it couldn’t read based on their low intensities. Such bases were said to have “Failed Chastity,” Chastity being the whimsical name of the filtering program. The remaining “chaste” data were now ready for SNP calling, the process of distinguishing errors from real variants.
In the lab Kevin asked me to hit the delete key on the computer that sat next to the machine that had sequenced my genome. The computer informed me that the 1.1-terabyte file was too big for the recycle bin and did I
really
want to permanently delete it? I said yes and within a minute it was gone. “Congratulations, you just deleted twelve thousand dollars,” said Kevin. “And a week’s worth of work.”
43
I moved into the interpretation realm down the hall. When I knocked on Dongliang Ge’s doorjamb he immediately put down his lunch and jumped out of his chair uttering apologies. He was short and well groomed: black hair parted on one side, pressed khakis, and a button-down shirt. He was good-natured, perhaps a little high-strung, and spoke with a bit of a stutter, which probably had something to do with English not being his first language. He trained as a biostatistician and genetic epidemiologist in Beijing. On his Duke faculty website was a quote from Albert Einstein: “Out of clutter, find simplicity. From discord, find harmony.”
44
His crowning achievement, at least in his young career thus far, was a tool designed to extract the simplicity and harmony from human genomes. He had developed a suite of Java-based software tools that provided a user-friendly interface to annotate, visualize, and help interpret the reams of data emerging from whole-genome association and sequencing studies.
45
These programs were so useful to the lab that they had “saved our asses,” according to David. Dongliang began to tutor me on the use of the Sequence Variant Analyzer, which let one look at the genome at any level of resolution: from an individual SNP to an entire set of twenty-three chromosomes.
46
“I want to know what are the genes related to the SNPs in your genome,” he explained. “What is the ontology? What are the pathways? We can’t get that from text files of sequence. We want to know the likely
functional
impact of variants in your protein-coding sequences. An amino acid change may not matter at all or it could be of great biological consequence.”
47
He gave me the caveats. “Of course, you are but a single, healthy individual. We can’t do real statistical testing just on you. We can look for places where genes appear to stop prematurely. We can look at large structural rearrangements of your DNA. But we will probably not know what is causal.”
48
“That’s okay,” I told Dongliang. I had long since consented to uncertainty.
“Let’s load your genome,” he said. Even he seemed a little excited; I was the first sequenced genome he knew personally. Before we could do that,
however, we had to dial up the warp drive. Dongliang pressed the button on the CPU under his desk and a loud whirring noise could be heard. His PC was essentially a server: it had forty-eight gigabytes of RAM and a massive hard drive.
49
In time, everyone in the Goldstein lab would have this type of computing power, which, by the time you read this, might not seem like such a big deal.
Dongliang explained to me that we could look at three files, each of which we would then filter so as to cull my genome of noise and things we were unlikely to be interested in, that is, the vast majority of the genome. The first file would contain all of my 3.6 million SNPs. The second would list the approximately 700,000 insertions and deletions (“indels”) that were less than 75 base pairs, or about the length of a single read on the Illumina sequencer; any insertions and deletions longer than that could not be observed directly—they could only be “predicted” with lower confidence. These larger, less certain disruptions were in the third file. Another file harbored all of the 26+ sequencing reads in total, but as I found out rather quickly, it was simply too large and overwhelming to browse unless you knew exactly what you were looking for … or had way too much time on your hands.
The loading took about ten minutes. “We will first perform quality control on your SNPs. We have a consensus score that reflects a number of parameters, including how well the short reads can be aligned to the reference and how confidently we can call SNPs.” A bar graph popped up on the screen. “Your curve is the typical normal curve,” he said. “That’s good.”
50
“How does the program know what’s a real SNP versus what’s an error?” I asked.
“If we see it in at least three separate reads, then we are confident it’s not a mistake,” said Dongliang. This criterion immediately got rid of four hundred thousand false SNPs. He did the same for small indels. “Now we can explore,” he said. On the screen appeared an animated ideogram of the X chromosome: a horizontal oblong pinched slightly off-center (the centromere, where chromosomes pair with each other) and was marked with black and gray irregular vertical stripes representing the way chromosomes take up stain under a microscope.
51
He zoomed in and pulled up the Factor VIII gene. Since he had been working with hemophilia samples, he was familiar with this gene: it makes an essential blood-clotting protein, and mutations in it cause classic hemophilia.
52
He studied the different-colored vertical dashes that represented variations along the rectangular diagram of the gene. “Well, you have nothing in any of the protein-coding exons for Factor VIII. Nor do you have any functional indels. That’s good,” he said matter-of-factly, “because otherwise you would be expected to have hemophilia.”
53
I imagined old-school clinical geneticists and genetic counselors cringing at the casual, cavalier nature of the discussion we were having. What if Dongliang had seen something unusual? Had he been melodramatic, he might well have said, “Oh my God! You’d better go see a hematologist … stat!” The Watson and Venter genomes had been curiosities—privileged men who had played major roles in the elucidation of the human genome. Maybe there was a sense among some that they were somehow
entitled
to their genomes and deserved to be first in line. But now, with people like me and the PGP-100 on the horizon, the hordes of mostly healthy barbarians were starting to arrive.
“Now let’s compare your Factor VIII to one of the hemophilia genomes,” Dongliang said. “This individual has a fourteen-base-pair indel in a coding exon. This is a very bad thing.” The hemophilia patient’s Factor VIII gene was shaded blue and black like mine … until two-thirds along the length of the gene; there the Sequence Variant Analyzer had shaded it green. “Green is everything after the indel,” said Dongliang, “which causes a premature stop.”
54
Our genetic code operates in triplets called codons: the consecutive bases AAA code for the amino acid lysine. TAC codes for tyrosine. AGG codes for arginine, and so on—there are sixty-four different triplets (three-letter combinations of A, C, T, and G), sixty-one of which code for the twenty amino acids used to make a human (the code is redundant, so most amino acids are coded for by more than one codon). Three of the triplets, however, do not code for an amino acid. They are called stop codons. They signal the end of the coding part of the gene and therefore, the protein. You can imagine codons I mentioned above lined up consecutively in a gene: AAA TAC AGG. In the corresponding protein (the average protein is about four hundred amino acids long), one would see the chain of amino acids lysine-tyrosine-arginine resulting from these three codons. Now imagine that a cellular accident occurred and the three bases before the last one were deleted. In the gene we would now have: AAA TA - - - G, with the dashes representing the deleted bases. As the cellular machinery transcribed this DNA sequence into RNA, which would then serve as the template to make the protein coded for by this gene, it would read the mistaken sequence that had suffered the deletion: AAA TAG. The first amino acid, lysine (AAA), would be read just fine. But the codon “TAG” does not code for an amino acid; it is a stop codon. The protein would simply end there. This is exactly what happened in the case Dongliang showed me.
“This individual has lost all of his Factor VIII exons after exon fourteen. That causes hemophilia A. I think your genome, on the other hand, is normal,” he said. “At least for Factor VIII.”
55