Life's Greatest Secret (36 page)

Read Life's Greatest Secret Online

Authors: Matthew Cobb

BOOK: Life's Greatest Secret
4.22Mb size Format: txt, pdf, ePub
Venter’s group then used a different approach, which was initially opposed by the publicly funded researchers but ended up dominating the field and has since been used in sequencing the genomes of many organisms. Known as shotgun sequencing, the technique involves identifying the bases on many short pieces of DNA and then assembling them into huge long sequences. Sequencing short stretches of DNA is easier, but it leads to a substantial difficulty: which of the resultant hundreds of thousands of short sequences follows on from which – how to reassemble the puzzle? This was particularly problematic when it came to dealing with the immense stretches of apparently functionless DNA to be found between genes, which could consist of featureless repetitions of two bases, such as ACACACAC….
To resolve this problem, Venter and his Celera colleagues enlisted computer scientists to develop algorithms for assembling the sequence, and they were able to prove the validity of their approach with the
Drosophila
genome. Despite hostility from many scientists around the world, Venter was probably right to argue that this method would make it possible to complete the project. Nevertheless, problems remained – even with the cleverest algorithms in the world, it is not possible to join up all of the bits of sequences. To get over this problem, recalcitrant parts of the genome were amplified in bacteria to try and bridge the gap. This does not always work – some sections of the human and the
Drosophila
genomes have still not been joined up, fifteen years after the sequences were published.
Despite the continuing clashes, the completion of the draft human genome was announced by President Bill Clinton in 2000, even though it was in fact nowhere near finished. Collins and Venter stood on either side of Clinton in the White House, while the ambassadors of the UK, Japan, Germany and France were in the audience. Meanwhile Tony Blair, along with Fred Sanger, Max Perutz and other British scientists, appeared at the end of a video feed from Downing Street. In a jarring counterpoint to the celebration of human ingenuity and the power of evolution that was on display, Clinton claimed that ‘Today we are learning the language in which God created life’.
41
Collins and Blair, both devout Christians, presumably concurred.
The draft sequence was published in 2001 in two versions: the Celera genome appeared in
Science,
and the publicly funded version was published in
Nature.
42
The International Human Genome Sequencing Consortium sequence is now taken as the definitive version, and was initially a mosaic of information from more than 100 individuals who contributed DNA to the publicly funded project (one of whom was Jim Watson) and the five individuals used in the Celera effort (one of whom was Craig Venter). It continues to be updated as genes are more effectively annotated, and functions or similarities can be more reliably located to particular stretches of DNA.
However, there is no such thing as ‘the’ human genome. On average, each of our genomes differs by about one base pair in a thousand, so by about 3 million base pairs in total. Most of those differences are not in coding DNA, and those differences that are in coding sequences are generally silent – they do not alter our amino acid sequences. Nevertheless, the overall structure of the human genome, its mixture of coding and non-coding sequences, and the way in which the coding sequences are expressed in time and space, form part of what it is to be human. And as the publicly funded researchers intended, the human genome is a public good, open to all and freely accessible on the Internet, out of reach of the patent lawyers – in 2013, after years of argument, the US Supreme Court finally ruled that no human genes could be patented, striking down patents that had been awarded to Myriad Genetics for use in diagnostic testing for the
BRCA1
gene (mutations in this gene can increase the risk of breast cancer).
43
That situation may change: in 2014, the Australian courts supported Myriad Genetics’s claim that human gene sequences could be patented.
44
*
Since the beginning of the twenty-first century and the triumph of the Human Genome Project, genome sequencing has been transformed from a highly complex international affair, immensely costly in terms of people and money, into something that can be undertaken by relatively small groups of researchers, interested in the most obscure organisms. Behind this change has been the appearance of what are called next-generation sequencing techniques based on robotics and powerful computers that were developed after the human genome sequence was completed.
45
The best sequencers available at the turn of the century used Sanger sequencing to simultaneously sequence about 100 stretches of DNA, each stretch producing reads that were up to 800 bases long. Next-generation sequencing is very different; it uses a variety of techniques to enable the machine to detect each base as it is incorporated into a new chain of DNA during DNA replication. The technology is continually being upgraded; as of 2014, hundreds of thousands of short strings of DNA – each 75–125 bases long – can be simultaneously sequenced, meaning that millions of bases can be detected in a second (when I was hand-sequencing in the 1990s, I was happy if I did 400 bases in a day). These fragments are randomly selected from the genome, and by carrying out this process millions of times, the entire genome can be covered. Computer algorithms are then used to assemble the sequence, meaning that next-generation sequencing is as much about mathematics as it is about molecular biology.
As the price of sequencing machines and computers has dropped, so too has the price of genomes. The human genome cost the public purse around $3bn – more or less a dollar a base pair – and used more than 1,000 sequencing machines. In 2010, the Chinese employed next-generation sequencing to analyse the 2.3 billion base pairs in the genome of the giant panda for a mere $900,000 – less than 0.04 cents per base, or 1/2,500 of the cost for the human genome. The whole project took less than a year, and used the equivalent of just thirty machines.
46
Most of the genomes from multicellular organisms thus far completed have been published in one of the leading scientific journals. That will inevitably change. According to the Genomes Online Database, at the end of 2014 there were more than 700 projects to sequence non-human vertebrate genomes alone. The genomes of the rattlesnake, the turkey vulture and Nancy Ma’s night monkey, along with hundreds of others, are all no doubt fascinating and will provide insight into evolution and medicine, but they – along with the hundreds of arthropods and the thousands of fungi that are being sequenced – are unlikely to get the same kind of attention as the platypus and the panda. The leading journals will focus on genomes with a high commercial or scientific impact, and which therefore promise a high rate of citation in the future. For the remainder of the natural world there will be electronic-only genome journals – already, most bacteria that are sequenced receive a brief one-page announcement with a link to the online database where the information is stored.
47
More sequencing developments are just around the corner: in 2014, Oxford Nanopore Technologies delivered early models of its nanopore sequencer to researchers around the world for beta testing. The device is the size of a mobile phone and plugs into a computer via the USB port. Unlike next-generation sequencing, which relies on computing power and parallel processing, this technology is claimed to create continuous DNA sequences of up to 10,000 base pairs on your desktop. If it lives up to the hype, DNA sequencing will become commonplace, and could even be done in the field on wild-caught samples, to identify particular genetic variants. Already, next-generation sequencing is being used on oceanic research expeditions.
48
Meanwhile, the market leader in next-generation sequencing machines, Illumina, has announced that its latest device will be able to sequence the equivalent of sixteen human genomes in three days, bringing the price for sequencing a whole human genome down to less than $1,000. There is a catch: the company insists that to get access to their technology, the user will have to buy ten machines at a total cost of at least $10m.
49
Whatever the coming years hold, the price of sequence data will continue to fall, and the number of sequences will continue to grow.
Eventually personalised medicine based on our individual genetic make-up will finally become widely available. The president of Illumina, Francis de Souza, has predicted that in 2015, an astonishing 228,000 human genomes will be sequenced in the name of medicine – the British government is currently supporting a project to sequence 100,000 genomes with the aim of improving the diagnosis, prevention and treatment of disease.
50
De Souza’s ambition is to move Illumina technology into the hospitals and to carve out a chunk of a diagnostic market that he estimates at $20bn. Whatever the hype associated with such claims, our understanding of the significance of small genetic differences between individuals – what is known as intraspecific variation – is growing as governments and research agencies around the world realise that there will be health benefits, as well as insight into the history and demography of human populations.
51
Already, the analysis of the genomic variations found in particular cancers has opened the road to new, more precise treatments. For example, the breast cancer drug Herceptin is targeted solely at women with cancers that have a genetic profile called HER2-positive, while patients with lung cancer whose tumours show mutations in the
EGFR
gene can be treated with drugs called Iressa and Tarceva.
52
Studies of genetic variation have led to radical new drug treatments that will transform the health of millions of people around the globe. At the beginning of the century, it was noticed that some families with extremely high levels of cholesterol had a particular form of a gene known as
PCSK9.
It then appeared that some people with very low levels of cholesterol had a mutation in this gene. In an extremely short period, drugs were developed to target the PCSK9 protein, and these should become available in 2015.
53
Scientists are now trawling through data from populations around the world, looking for genetic variants that correlate with particular health conditions and which could provide an insight into new drug development. Sequencing is beginning to transform medicine.
These technical developments highlight the ingenuity of molecular biologists, engineers and computer scientists, but they have created an intriguing new problem. We now have thousands of genomes sequenced, and the rate at which they are being completed has grown exponentially, outstripping our ability to analyse them.
54
In July 2011, only 36 eukaryotic genomes had been sequenced; a year later, another 140 had been added; by 2014, 5,628 eukaryotic genome sequences had been either begun or completed, and 36,000 prokaryotic genomes had been sequenced.
55
At the time of writing, the largest known genome is that of the loblolly pine tree, which comes in at a whacking 22 billion base pairs – about seven times the size of the human genome.
56
In contrast, the microbe
Nasuia deltocephalinicola
has a genome of just 112,000 base pairs – this organism is found uniquely in the guts of leafhopper insects, so that much of its metabolic work is done by its host.
57
With fewer chemical reactions to process, its genome has gradually become reduced in size over the 260 million years that the microbe has been living in the insect, losing unnecessary protein-coding genes much as a parasitic animal loses unnecessary anatomical structures. Scientists have calculated that such symbionts could get by with as few as 93 protein-coding genes, which would probably fit into a genome of merely 70,000 base pairs.
58
Producing a genomic sequence is now relatively simple, at least compared with the effort involved in the pioneering studies. The problem begins when you try to understand what the genome actually does. One of the main tasks when a genome has been completed is to annotate it, identifying genes and their exons and introns, and above all finding genes that have equivalents in other organisms, preferably with some kind of known function. Often the only basis for identifying the function of a gene is because its DNA sequence is similar to a gene in a different organism where a function has been demonstrated. This has led to a new discipline called genomics, which involves obtaining genomes and understanding their nature and evolution. It includes a new set of techniques, collectively called bioinformatics, which combine computing and population genetics to make inferences about the patterns of evolution and enable us to determine which genes have a common origin or function. Training biologists in the techniques of computer science will be an important part of twenty-first-century scientific education.
One of the most far-reaching scientific consequences of sequencing came with the work of Carl Woese, who realised in the 1960s that he could use the RNA found in ribosomes (rRNA), which is common to every organism on the planet, to study patterns of evolution. Woese began studying variation in the nucleotide sequence of the 16s rRNA subunit in a range of bacteria, and by the mid-1970s he had sequenced part of this rRNA from around thirty species – the work was extremely slow and arduous. In 1977, Woese published two papers with George Fox in which they claimed that prokaryotes – single-celled organisms with no nucleus – were not a single group with a common evolutionary history. Basing their analysis on the rRNA sequences – a far more rigorous approach than the mixture of morphological, physiological and ecological data that had previously been employed – Woese and Fox proposed to split the bacteria into two groups: the Eubacteria (or true bacteria) and the Archae-bacteria.
59
The data showed clearly that the Archaea, as they were later called, were no more closely related to bacteria than they were to eukaryotes, like you and me. Eventually this led Woese to propose that life evolved into what he called three domains: Bacteria, Archaea and Eukaryota.

Other books

Bear Claw Bodyguard by Jessica Andersen
The Cilla Rose Affair by Winona Kent
Eternity's End by Jeffrey Carver
Un milagro en equilibrio by Lucía Etxebarria
Return To Forever by James Frishkey
The Flock by James Robert Smith