The Deeper Genome: Why there is more to the human genome than meets the eye, by John Parrington, Oxford University Press, 272 pp, £18.99, ISBN: 978-0199688739
For a clinician, there is something very satisfying about situations where the findings from an obscure, now almost forgotten, scientific experiment continue to play a role in one’s day-to-day practice. For example, when I am preparing to intubate a patient, the image of a donkey being kept alive with a pair of bellows sometimes comes to mind. It is easier to intubate (pass a breathing tube through the vocal cords to artificially ventilate a patient’s lungs) if a muscle-relaxing agent such as atracurium is administered prior to the procedure. The structure of atracurium, and other similar synthetic medications, is based on that of a substance called curare, which is the toxin used in poisoned arrows by Amazonian Indians. D-tubocurarine (the scientific name for curare) was first demonstrated to have a physiological effect on large mammals in 1814, when Charles Waterton administered it to a female donkey at the Royal Veterinary College in London. Waterton was the son of a Yorkshire squire who spent twenty years travelling in the Amazonian rainforest at the start of the nineteenth century. During the latter part of this period he spent several months searching for the most potent wourali (as he called curare) available, finally sourcing his specimens from the Macushi tribe in the south of Guiana, near the border with Brazil.
We now know that D-tubocurarine binds to receptors on the junctions between nerves and muscles, and interrupts the signals that lead to muscle contractions. This makes breathing, and all other voluntary movement, impossible. Therefore when the D-tubocurarine was injected into the donkey she stopped breathing, although her heart continued to beat. At this point an incision was made into her windpipe, bellows were inserted, and she was kept alive by the efforts of Waterton and his fellow vivisectionists until the effects of the toxin wore off two hours later. Animal lovers will be pleased to hear that Wouralia (as she was subsequently named) survived for a further twenty-five years at Walton Hall, near Wakefield.
Back in the twenty-first century, once vecuronium has been administered to my patient, there will be a few minutes in which to insert a breathing tube through their now relaxed (and therefore open) vocal cords, and attach them to the ventilator. Another class of agents acting on the neuromuscular junction, called the cholinesterase inhibitors, have the opposite effect and overactivate the muscle junction, leading to excessive contraction of the muscles. Because they counteract the action of atracurium, they are used as an antidote if too much muscle relaxant has been given. However, in excessive doses, the overactivaction of the muscle junction will eventually lead to death, and cholinesterases are the active component of sarin gas, as used by Saddam Hussein in northern Iraq in the 1980s, and possibly during the civil war in Syria in 2013.
These historical examples are intellectually appealing for a number of reasons. Their elucidation of biological processes allows us to make a connection between the events taking place at a microscopic cellular level and the resulting effects visible in individuals. They remind us of the power that we have to harness our understanding of nature for honourable, and dishonourable, ends. Finally, by allowing us to engage with the actions of individuals who are now long dead, we can feel that in our daily lives we are in some way continuing their narrative of scientific curiosity, discovery and exploration.
A large part of James Watson and Francis Crick’s success after their discovery of the double helix structure of DNA in 1953 was related to the first of these reasons. They not only made a significant scientific discovery but also went on to coin the vocabulary and imagery for a principle which was simple to grasp and could easily be applied to our understanding of evolution and disease. Watson and Crick’s so-called central synthesis was that DNA makes RNA, which in turn makes protein. DNA (deoxyribonucleic acid) is a double helical sequence of deoxyribose sugars, a phosphate group backbone and one of four nucleobases (A, T, C or G), organised (in most humans) into one of forty-six chromosomes. Each protein coding sequence of DNA (a gene) acts as the template for a complementary RNA (ribonucleic acid) sequence in a process called transcription. This RNA in turn uses a triplet code (codons) to specify the sequence of amino acids which go on to make up a protein, in a process called translation.
Turning to evolution, Watson and Crick knew from Darwin that natural selection occurs on phenotypes, that is the set of observable characteristics of an individual. With the central synthesis, phenotypes were seen as the protein end product of a DNA code. Thus when mutations occur in DNA, these lead to modified or new proteins and a resulting change in phenotypes, which are selected for or against and may eventually lead to new species. For example, a mutation in brown bears at some point between 200,000 and 125,000 years ago led to a subgroup having lighter and thicker coats, which allowed enhanced survival in higher latitudes and led to the spread of this mutated gene throughout the population. We now know this subgroup as polar bears.
As regards our understanding of pathophysiology, if mutations occur in the DNA which lead to a disruption of the RNA sequence, they lead to a misfolded or truncated protein, which can in turn lead to disease. Individuals with two mutated copies of the CFTR (cystic fibrosis transmembrane conductance regulator) gene will produce a faulty protein which is unable to regulate the movement of chloride in and out of cells, leading to an accumulation of mucus in the lungs and pancreas, and thus the complications of cystic fibrosis.
Once DNA had been identified as the fundamental building block underlying the development of an organism, disease and evolution, the logical next step was to sequence the entire genome. Identifying the full sequence of all of our DNA would give us deep insight into human traits and human disease. If this sounds hyperbolic, in 2000, when the first rough draft of the human genome was sequenced, President Clinton announced: “We are here to celebrate the completion of the first survey of the entire human genome. Without a doubt, this is the most important, most wondrous map ever produced by human kind.” Tony Blair, never one to be left behind, similarly asserted that “every so often in the history of human endeavour, there comes a breakthrough that takes humankind across a frontier into a new era … today’s announcement is such a breakthrough, a breakthrough that opens the way for massive advancement in the treatment of cancer and hereditary diseases. And that is only the beginning.”
What has happened in the fifteen years following the sequencing of the first human genome and the wave of optimism and excitement that followed it? Craig Venter was the CEO of Celera Genomic, the private company that competed with the publicly funded Human Genome Project and which together made the announcement in 2000 of the mapping of the genome (in the case of Celera Genomic’s, his own). At the time he confidently stated that “it is my belief that the basic knowledge we’re providing to the world will have a profound impact on the human condition and the treatment for disease”. But by 2010 he told an interviewer at Der Spiegel that “we have learned nothing from the genome other than probabilities. How does a one or three percent increased risk for something translate into the clinic? It is useless information.” Similarly, when talking to a scientific colleague about genome wide association studies (GWAS for short), of which over two thousand have been published in the last ten years, she asked me, in a genuinely inquiring way, “has anyone ever learnt anything at all from them?” So what went wrong for genomics?
The problem is not related to a lack of capital. The original human genome project took thirteen years to complete and cost an estimated $3 billion; since 2005 a further $250 million has been spent on GWAS. In 2012 a project called ENCODE (ENCyclopedia of DNA Elements) published its findings, having spent $288 million over nine years gathering data on the activity of DNA inside different cell types. Although the cost of sequencing a complete human genome has now fallen to an estimated $1,000, the enthusiasm of scientific funding bodies for genomics research means that considerable sums are being spent studying the DNA of ever larger populations. The 1000 Genomes Project, which sampled individuals from round the world, published its final data set earlier this year. In the UK, Generation Scotland is planning to sequence thirty thousand full genomes, and the 100,000 Genomes Project is now recruiting in England. Even once the sequence data (a list of three billion As, Ts, Cs and Gs) has been made available, the challenge of storing it and processing it is formidable. In 2014 I talked to a genomics researcher who told me that each Illumina HiSeq X (the most up-to-date genome sequencer) would be able to output as many human genomes each year as already existed in the world (1,800 per machine if running at full capacity), and that each machine would generate more terabytes (that’s 1,000,000,000,000 bytes) of data annually than his institute had ever generated in their research so far.
There are some, like Sydney Brenner, who collaborated with Crick at Cambridge, who believe that this obsession with volume has corrupted science: “it has created the idea that if you just collect a lot of data, it will just work out”. John Parrington is more optimistic about the possibilities of genomics and big data science. In The Deeper Genome, he has written a lucid account of the history of our understanding of genetics and heritability, and the place of genomics in modern science and ultimately in society.
The Deeper Genome is highly readable, largely because it tells the stories of a number of maverick, obsessive and often obstinate scientists who have driven the field. Tellingly perhaps, these lives shine more brightly in the book than big data itself, which like the picture of the iceberg on the front cover, looms menacingly throughout, its potential unveiled and as yet unclear. As well as the usual suspects of Wallace, Darwin, Mendel, Watson, and Crick, we meet engaging characters like the Frenchman Jacques Monod. A research biochemist at the Sorbonne during World War Two, Monod continued to examine the differential growth of bacteria in glucose and lactose whilst simultaneously acting as chief of staff for the French Resistance, hiding important documents inside the femurs of the giraffe skeletons outside his laboratory. He continued his subversive activities after the war, and in 1960 smuggled one of his collaborators, Agnes Ullman, out of Hungary hidden underneath a bathtub in a compartment of a camping trailer after she was implicated in the failed 1956 revolution.
Mark Ptashne was a junior member of Watson’s laboratory at Harvard when the latter moved from Cambridge to the United States. Ptashne, when not conducting lecture tours of North Vietnam as bombs fell on the country, was dismissive of his research colleagues, who “weren’t willing to take the kinds of risks that were necessary [to isolate a transcription factor] … psychic risks”. We learn about Dr Giles Brindley, who in 1983 presented at the American Urological Society meeting in Las Vegas on a new medication he had been working on. To overcome scepticism from his colleagues about its efficacy, he injected himself with the drug just before he was due to talk and “over the course of the lecture demonstrated to his audience visible evidence” of the effectiveness of the compound. The substance in question was Sildenafil (better known by its trade name, Viagra), said to be the first drug to go in development from bench to bedroom.
But by 2013 the colourful Dan Graur, an academic at Houston University who showed his audience a photograph of dollar bills taped together in the shape of a toilet roll to demonstrate his views of how successful he believed the ENCODE project to have been, is a lonesome figure. He is dwarfed by the largely faceless 442 scientists involved in ENCODE or the many hundreds involved in the Human Genome Project.
As well as finding scientists more engaging when they appear troubled, uncertain or just slightly mad, we find science easier to understand with simplifying analogies or metaphors. Parrington shows that our understanding of how the activity of individual cells translates into the physiology of an entire organism has been shaped by the technology and social mores of researchers at the time. Biochemists at the tail end of the Industrial Revolution describe the cell as a factory. Over time, as understanding of the complexity of cell activity grew, this was modified further into the idea of the cell nucleus acting like managers in a central office, coordinating the factory floor of the cytoplasm.
When the structure and activity of DNA was first described in the 1950s, it was seen as a written blueprint for how an organism would construct itself. In a nod to this, the Wellcome Collection in London has a large bookshelf with a full printout of the human genome (in minuscule As, Ts, Gs, and Cs) spread over a hundred volumes. Later on, with the rise of the computer age, the genome started to be compared to a linear sequence of code for “writing” protein, and by the late 1980s the contents of a human genome were being compared to a compact disc: all the information about an individual might in the future be stored in digital format. Increasingly, by the 2000s, the genome was being described as a modifiable software program with instructions to proteins for how to run cells and organs running in parallel.
A significant challenge to the explanatory power of the central synthesis has come from the realisation that all genomes contain a substantial proportion of DNA that is not protein coding. This is not a new concept. An awareness of so called “junk DNA” began in the early 1970s, and by 1976 we find Richard Dawkins stating with characteristic confidence that “the simplest way to explain the surplus DNA is to suppose it is a parasite, or at best a harmless but useless passenger”. Perhaps not unexpectedly, he goes on to use the existence of this “junk” DNA as an argument against a religious interpretation of the origins of life ‑ why would God create such a messy mechanism? With time this “junk” DNA has been integrated into the computing language metaphor. Elon Musk states that “trying to read our DNA is like trying to understand software code ‑ only with 90% of the code riddled with errors. It’s very difficult in that case to understand and predict what the software code is going to do”.
We now know that about 2 per cent of the genome codes directly for proteins, and that we only have 22,333 genes (rather than the 100,000 or so posited before the full human genome became available). In an added challenge to the central synthesis, there doesn’t appear always to be a linear relationship between the complexity of an organism and how many genes it has. It seems obvious that a tiny virus such as influenza might have a mere eleven genes, compared with 14,889 in a fruit fly, and another 7,444 more in a human, but why would a grape need 30,434?
It is the question of what the remaining 98 per cent of the genome actually does, and the complex relationship between genetic material and the resulting organism, that is the focus of the greater part of Parrington’s book. Part of the lack of insight that studies into human genomes have yielded so far is related to the fact that mutations in the protein coding sequences of the genome are rare, and much of human variation thus lies within the large, and still poorly understood, non-coding sequences. The genome wide association studies mentioned previously look at variations within and between populations throughout the genome at locations called SNPs (single nucleotide polymorphisms). At such a location the nucleotide A (adenine) might be replaced by a C (cytidine), with a certain proportion of the population sharing this SNP (which is called an allele). GWAS sample the variation in thousands or millions of SNPs for large numbers of individuals throughout their genomes. Statistical software is then used identify SNPs that are associated to a high degree of statistical power with particular traits (such as height or weight) or diseases (such as risk of heart disease or cancer).
However, many SNPs which are found to associate with traits or diseases are located in non-coding regions of the genome, and their functional significance is therefore hard to interpret. Furthermore, each of these SNPs appears to contribute towards only a small fraction (Craig Venter’s “one or three percent increased risk”) of the variation in a particular trait or susceptibility to a disease. It is the uncertainty about how these variants might exert an effect, and the apparently small effects of each individual genetic variant, which have led to the disappointment with these studies expressed by my colleague.
However, the complexity of the non-coding part of the genome is slowly beginning to be understood. This is in part due to the massive acceleration in the performance of computer software and high throughput sequencing. For example, the most recent project looking at RNA activity in cells, called FANTOM5, examined the activity of 260,000 regulatory regions in 1,800 human cell samples, an enterprise that would have been unimaginable fifteen years ago.
In The Deeper Genome, Parrington takes us on a tour of these non-coding sequences, detailing first the promoter and enhancer regions. Promoter sequences are the sections of DNA to which the RNA-writing machinery binds. Enhancer sequences regulate how frequently promoters are used. While it was previously thought that enhancers must lie close to promoters, it is now clear that they may lie thousands (or even millions) of bases up or downstream of the promoters. Here the analogy of DNA being like a digital code read linearly, as in a CD, falls apart. It appears that DNA may have a far richer and more complicated three-dimensional landscape than previously thought, as chromosomes bend backwards and forwards to up- and downregulate the processing of RNA and eventually protein. Similarly, it is possible that strands of many different chromosomes are simultaneously pulled together towards a single central locus that regulates the production of protein from different sequences of DNA in synchrony. There is as yet no way of observing these changes in real time, meaning that any attempt to understand these structures involves their destruction, in a twenty-first century reworking of Heisenberg’s uncertainty principle. The technology determines to a large extent what we are able to understand. Returning to the Wellcome Collection’s bookshelf, it is as if we had ripped all the pages out of the books and created a large, complex, multipart origami sculpture from them. The technology available at present means that in order to understand the organisation of our origami piece we have to first unfold our constituent sheets of paper. Then all we have to go by to deduce the original structure are the patterns of creases on the unfolded individual sheets, without the sequence of those folds being clear, and bearing in mind that there are three billion base pairs in each human genome to take into account.
Yet there is another order of complexity: DNA and the proteins that it is bound to (histones) to make up the chromosomes can be modified with the addition of a methyl or acetyl group in ways which up- or downregulate the activity of a gene. These groups can either wind or unwind the DNA strands and make them more or less accessible to the transcriptional machinery that makes RNA. The presence of these methyl and acetyl groups appears to be determined to a certain extent by the circumstances of each individual, so that environmental changes (for example in utero or in early childhood) may have long-lasting consequences for how each cell functions: so-called epigenetics.
Finally, it is apparent that far more RNA is produced than is eventually translated into protein. So far four classes of non-coding RNA have been described: silent RNAs (which act by binding to and destroying RNAs which were going to create protein); microRNAs (which can increase or decrease the capacity of RNA to make protein); piRNAs (which regulate the movement of DNA around a genome); and long non-coding RNAs, which appear to act to bring different parts of the genome together to create the 3D networks of functionality described above.
The ENCODE project examined the proportion of the genome that appeared to be acting as promoters, enhancers, were acetylated or methylated, or coded for RNA, and came to the conclusion that 80 per cent of the genome has a function, using activity as a proxy for functionality. With over 20,000 genes and up to 70,000 functional RNAs (this number remains highly speculative), working together in a complicated, constantly modified 3D network, Ewan Birney, the spokesperson for the ENCODE project, was perhaps right to claim that “it’s like a jungle in there”. Whilst the metaphor of a jungle is descriptive, it has little of the explanatory power and appeal of Watson and Crick’s “secret of life”.
Lack of widespread dissemination of this new version of how the genome works has possibly suffered from the lack of charismatic figureheads. They have largely been replaced by scores of computer experts working together on unimaginably large data sets, in large buildings that hum quietly and anonymously in the way that one imagines Google and Facebook offices do. It has also suffered from the lack of a clear simplifying language for the processes taking place, and from the absence of metaphors that provide insight into the biology. But what is probably most damaging is the lack of impact that this newfound understanding appears to have had so far on our ability to treat human disease. Even for well-understood conditions like cystic fibrosis, where individuals with the disease (in the Western world at least) often know the exact location of the mutation on their CFTR gene, successful therapies have proved elusive. The latest trial showed a 4 per cent improvement in lung function in the gene therapy intervention group, hardly enough to excite the parents of newborn children with this life-limiting disease. For more common conditions like heart disease, cancer or schizophrenia, progress has been even less marked, even just in understanding the role of genetic contribution to disease, let alone therapies. Susceptibility appears to be determined from the interaction of a large number of poorly understood variations in non-coding regions of the genome, making the design of effective interventions challenging to say the least.
The Deeper Genome concludes by speculating that what might be needed is a paradigm shift in how we see biological systems. Critics of the reductionist method, which dissects biological systems into their constituent parts to illuminate the whole, argue that this approach has reached its limit. However, as Parrington points out, it is one thing to state the problem, and quite another to find a way to understand and interpret its complexity. As Goethe said, “if we want to attain a living understanding of nature, we must become as flexible and mobile as nature herself”.
For now, no one is quite clear how we will get there.
1/3/2016
Thomas Christie Williams is a paediatrician and a clinical lecturer at the University of Edinburgh. He was previously an archaeologist.