Filling in the Book of Life’s Missing Chapters

Improvements in sequencing techniques have allowed researchers to completely map out the human genome, setting the stage for a more robust understanding of disease and treatment.


When the Human Genome Project unveiled initial findings in the early 2000s, many experts and public commentators touted it as having written “the book of life”. After all, the Project’s main aim was to sequence our DNA, the core molecule that contains all the information and instructions underpinning human biology.

In the years that followed, scientists have used findings from the Project to deepen our understanding of health and diseases, and improve medical and healthcare technologies. However, the field has also grown increasingly aware that the human genome remains incompletely mapped out—and that these gaps could contain crucial information needed to finally solve hard-to-treat diseases.

After successfully implementing new technology to demonstrate the first “ultra-long” reads surpassing 100,000 bases, colleagues Dr. Karen Miga and Dr. Adam Philippy decided to team up to finish sequencing the human genomework that would entail assembling some of the most difficult regions of the genome.

“The reason the genome was finished in 2022 is because the technologies finally advanced to the point of making it possible. It would have been totally impossible to do this in the early 2000s, no matter how much money you were to throw at the problem, because the sequencing technology was just not advanced enough,” Phillippy said.

In a recent paper published in Science, a global team of researchers led by Phillippy, head of the Genome Informatics Section at the National Human Genome Research Institute (NHGRI), and Miga, Assistant Research Scientist at University of California, Santa Cruz (UCSC) report their success in filling out the book of life’s missing chapters.

A tale of technological innovation

Held back by the technological limits of its time, the Human Genome Project was only able to sequence about 90 percent of the human genome. In particular, the Project was only able to map out the loosely-condensed and gene-rich regions of chromosome, called euchromatin.

In contrast, heterochromatin, or parts of the chromosome that are very tightly packed, were inaccessible, and remained so for several years after the initial readout of the Project.

“A lot of the implications will be evident years from now when we better understand the role and function of these elements”, Phillippy stated. “But one important note is that each of these sequences are absolutely critical for the normal function of cells. Satellite DNAs make up the centromeres, which are critical for cell division. The rDNA code for the ribosomes, which are arguably the most important molecular components of a living cell.”

It would take nearly a decade before innovation in the field started producing techniques that could access these hard-to-reach areas.

“PacBio was the first technology that could produce sequencing reads greater than 10,000 bases, and my group was an early adopter of this technology in the early 2010s,” Phillippy said, adding that while these new “long” reads could yield gapless sequences of bacterial genomes, they still fell short for the more complex vertebrate genomes.

In 2015, Oxford Nanopore Technologies pushed their read lengths even longer and in 2018, Phillippy and his team proved it was possible to produce reads longer than 100,000 bases. Finally, in the spring of 2020, as PacBio’s HiFi sequencing technology was introduced, Phillippy’s team also had interesting progress.

My postdoc Sergey Nurk had developed some experimental methods for assembling this new datatype and showed me some fantastic early results assembling the CHM13 HiFi data that the Eichler lab had generated.”

Armed with the promising results that even the COVID-19 pandemic cannot impede, Phillippy organised a workshop at the NGHRI that would funnel expertise from different parts of the world. “Amazingly, and completely unexpectedly, we had gapless assemblies for all 23 chromosomes by the end of the summer,” Philippy candidly recalls.

Revealing the repeats

In their Science paper, the research team combined not just the latest PacBio and Nanopore technologies, but also other sequencing platforms from other companies. They also developed their own methods to assemble the reads into even longer sequences, polish their assemblies and validate their overall findings.

The result is a complete, gapless and end-to-end sequence of all 22 human autosomes and the X chromosome. In total, Phillippy’s team mapped out more than 3.05 billion base pairs of nuclear DNA and over 16,000 base pairs of the mitochondrial genome. Along the way, they were able to correct prior sequencing errors in 238 million base pairs, many of which were found in the tightly-packed chromosome regions.

What was hiding in the previously unmapped regions of the genome?

“The short answer is repeats,” said Phillippy, referring to stretches of DNA that occur over and over again throughout the genome. Such repetitive elements are known to be important in maintaining the 3-dimensional structure of the chromosome, as well as proper gene expression regulation, and others can contain additional copies of functional genes.

“All told, we uncovered around 200 million new bases of sequence,” Phillippy said. “Before the advent of long-read sequencing and the assembly methods developed by our consortium, we simply didn’t have the tools necessary to assemble them.

It was important to make sense of these common missing features as “segmental duplications are a rich source of variation between humans and our nearest primate relatives, and so may hold clues to what makes us uniquely human”.

‘Every genome is unique’

In the same way that technological developments enabled Phillippy’s team to completely sequence the genome, he expects their findings to also set up future research teams for similar breakthroughs.

“Our work shows what will be possible within the next 20 years, and that is the cheap and routine sequencing of complete human genomes,” he said.

In the long run, a gapless genome sequence will pave the way for an even better appreciation of how diseases arise and what underpins their pathology and progression. This knowledge will eventually help in the development of more effective treatments, ultimately improving human health.

“It is indeed a great milestone, and a testament to how far the field has come in the past 20 years. But it will take continued hard work to realise the promise of improved human health,” explained Phillippy, adding that further studies are needed to identify culprit alleles or other similar molecules and elucidate the exact role they play in diseases.

Much work still needs to be done on the genome sequencing front too. Although complete, Phillippy’s team was able to map out just one person’s genome—and they are looking to expand their project to diverse populations all over the world.

“Every genome is unique, and the complete mapping of additional genomes from around the world will result in even bigger improvements and an even greater understanding of the genome,” he said.