Towards complete and error-free genome assemblies of all vertebrate species

Rhie A; McCarthy SA; Fedrigo O; Damas J; Formenti G; Koren S; Uliano-Silva M; Chow W; Fungtammasan A; Kim J; Lee C; Ko BJ; Chaisson M; Gedman GL; Cantin LJ; Thibaud-Nissen F; Haggerty L; Bista I; Smith M; Haase B; Mountcastle J; Winkler S; Paez S; Howard J; Vernes SC; Lama TM; Grutzner F; Warren WC; Balakrishnan CN; Burt D; George JM; Biegler MT; Iorns D; Digby A; Eason D; Robertson B; Edwards T; Wilkinson M; Turner G; Meyer A; Kautt AF; Franchini P; Detrich HW; Svardal H; Wagner M; Naylor GJP; Pippel M; Malinsky M; Mooney M; Simbirsky M; Hannigan BT; Pesout T; Houck M; Misuraca A; Kingan SB; Hall R; Kronenberg Z; Sović I; Dunn C; Ning Z; Hastie A; Lee J; Selvaraj S; Green RE; Putnam NH; Gut I; Ghurye J; Garrison E; Sims Y; Collins J; Pelan S; Torrance J; Tracey A; Wood J; Dagnew RE; Guan D; London SE; Clayton DF; Mello CV; Friedrich SR; Lovell PV; Osipova E; Al-Ajli FO; Secomandi S; Kim H; Theofanopoulou C; Hiller M; Zhou Y; Harris RS; Makova KD; Medvedev P; Hoffman J; Masterson P; Clark K; Martin F; Howe K; Flicek P; Walenz BP; Kwak W; Clawson H; Diekhans M; Nassar L; Paten B; Kraus RHS; Crawford AJ; Gilbert MTP; Zhang G; Venkatesh B; Murphy RW; Koepfli KP; Shapiro B; Johnson WE; Di Palma F; Marques-Bonet T; Teeling EC; Warnow T; Graves JM; Ryder OA; Haussler D; O'Brien SJ; Korlach J; Lewin HA; Howe K; Myers EW; Durbin R; Phillippy AM; Jarvis ED

doi:10.1038/s41586-021-03451-0

Fulltext

Towards complete and error-free genome assemblies of all vertebrate species

Rhie A ¹ , McCarthy SA ² , Fedrigo O ³ , Damas J ⁴ , Formenti G ³ , Koren S ¹ Show all authors , Uliano-Silva M ⁵ , Chow W ⁶ , Fungtammasan A ⁷ , Kim J ⁸ , Lee C ⁸ , Ko BJ ⁹ , Chaisson M ¹⁰ , Gedman GL ¹¹ , Cantin LJ ¹¹ , Thibaud-Nissen F ¹² , Haggerty L ¹³ , Bista I ² , Smith M ⁶ , Haase B ³ , Mountcastle J ³ , Winkler S ¹⁴ , Paez S ³ , Howard J ¹⁵ , Vernes SC ¹⁶ , Lama TM ¹⁷ , Grutzner F ¹⁸ , Warren WC ¹⁹ , Balakrishnan CN ²⁰ , Burt D ²¹ , George JM ²² , Biegler MT ¹¹ , Iorns D ²³ , Digby A ²⁴ , Eason D ²⁴ , Robertson B ²⁵ , Edwards T ²⁶ , Wilkinson M ²⁷ , Turner G ²⁸ , Meyer A ²⁹ , Kautt AF ²⁹ , Franchini P ²⁹ , Detrich HW ³⁰ , Svardal H ³¹ , Wagner M ³² , Naylor GJP ³³ , Pippel M ¹⁴ , Malinsky M ⁶ , Mooney M ³⁴ , Simbirsky M ⁷ , Hannigan BT ⁷ , Pesout T ³⁵ , Houck M ³⁶ , Misuraca A ³⁶ , Kingan SB ³⁷ , Hall R ³⁷ , Kronenberg Z ³⁷ , Sović I ³⁷ , Dunn C ³⁷ , Ning Z ⁶ , Hastie A ³⁸ , Lee J ³⁸ , Selvaraj S ³⁹ , Green RE ³⁵ , Putnam NH ⁴⁰ , Gut I ⁴¹ , Ghurye J ⁴² , Garrison E ³⁵ , Sims Y ⁶ , Collins J ⁶ , Pelan S ⁶ , Torrance J ⁶ , Tracey A ⁶ , Wood J ⁶ , Dagnew RE ¹⁰ , Guan D ² , London SE ⁴³ , Clayton DF ⁴⁴ , Mello CV ⁴⁵ , Friedrich SR ⁴⁵ , Lovell PV ⁴⁵ , Osipova E ¹⁴ , Al-Ajli FO ⁴⁶ , Secomandi S ⁴⁷ , Kim H ⁸ , Theofanopoulou C ¹¹ , Hiller M ⁴⁸ , Zhou Y ⁴⁹ , Harris RS ⁵⁰ , Makova KD ⁵⁰ , Medvedev P ⁵¹ , Hoffman J ¹² , Masterson P ¹² , Clark K ¹² , Martin F ¹³ , Howe K ¹³ , Flicek P ¹³ , Walenz BP ¹ , Kwak W ⁵² , Clawson H ³⁵ , Diekhans M ³⁵ , Nassar L ³⁵ , Paten B ³⁵ , Kraus RHS ²⁹ , Crawford AJ ⁵³ , Gilbert MTP ⁵⁴ , Zhang G ⁵⁵ , Venkatesh B ⁵⁶ , Murphy RW ⁵⁷ , Koepfli KP ⁵⁸ , Shapiro B ⁵⁹ , Johnson WE ⁵⁸ , Di Palma F ⁶⁰ , Marques-Bonet T ⁶¹ , Teeling EC ⁶² , Warnow T ⁶³ , Graves JM ⁶⁴ , Ryder OA ³⁶ , Haussler D ³⁵ , O'Brien SJ ⁶⁵ , Korlach J ³⁷ , Lewin HA ⁴ , Howe K ⁶⁶ , Myers EW ⁶⁷ , Durbin R ⁶⁸ , Phillippy AM ⁶⁹ , Jarvis ED ⁷⁰

Affiliations

¹ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
² Department of Genetics, University of Cambridge, Cambridge, UK
³ Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
⁴ The Genome Center, University of California Davis, Davis, CA, USA
⁵ Leibniz Institute for Zoo and Wildlife Research, Department of Evolutionary Genetics, Berlin, Germany
⁶ Wellcome Sanger Institute, Cambridge, UK
⁷ DNAnexus Inc., Mountain View, CA, USA
⁸ Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
⁹ Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
¹⁰ University of Southern California, Los Angeles, CA, USA
¹¹ Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
¹² National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
¹³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
¹⁴ Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
¹⁵ Novogene, Durham, NC, USA
¹⁶ Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
¹⁷ University of Massachusetts Cooperative Fish and Wildlife Research Unit, Amherst, MA, USA
¹⁸ School of Biological Science, The Environment Institute, University of Adelaide, Adelaide, South Australia, Australia
¹⁹ Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
²⁰ Department of Biology, East Carolina University, Greenville, NC, USA
²¹ UQ Genomics, University of Queensland, Brisbane, Queensland, Australia
²² Department of Biological Sciences, Clemson University, Clemson, SC, USA
²³ The Genetic Rescue Foundation, Wellington, New Zealand
²⁴ Kākāpō Recovery, Department of Conservation, Invercargill, New Zealand
²⁵ Department of Zoology, University of Otago, Dunedin, New Zealand
²⁶ University of Arizona Genetics Core, Tucson, AZ, USA
²⁷ Department of Life Sciences, Natural History Museum, London, UK
²⁸ School of Natural Sciences, Bangor University, Gwynedd, UK
²⁹ Department of Biology, University of Konstanz, Konstanz, Germany
³⁰ Department of Marine and Environmental Sciences, Northeastern University Marine Science Center, Nahant, MA, USA
³¹ Department of Biology, University of Antwerp, Antwerp, Belgium
³² Institute of Biology, Karl-Franzens University of Graz, Graz, Austria
³³ Florida Museum of Natural History, University of Florida, Gainesville, FL, USA
³⁴ Tag.bio, San Francisco, CA, USA
³⁵ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
³⁶ San Diego Zoo Global, Escondido, CA, USA
³⁷ Pacific Biosciences, Menlo Park, CA, USA
³⁸ Bionano Genomics, San Diego, CA, USA
³⁹ Arima Genomics, San Diego, CA, USA
⁴⁰ Independent Researcher, Santa Cruz, CA, USA
⁴¹ CNAG-CRG, Centre for Genomic Regulation, Barcelona Institute of Science and Technology, Barcelona, Spain
⁴² Dovetail Genomics, Santa Cruz, CA, USA
⁴³ Department of Psychology, Institute for Mind and Biology, University of Chicago, Chicago, IL, USA
⁴⁴ Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
⁴⁵ Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, OR, USA
⁴⁶ Monash University Malaysia Genomics Facility, School of Science, Selangor Darul Ehsan, Malaysia
⁴⁷ Department of Biosciences, University of Milan, Milan, Italy
⁴⁸ LOEWE Centre for Translational Biodiversity Genomics, Frankfurt, Germany
⁴⁹ BGI-Shenzhen, Shenzhen, China
⁵⁰ Department of Biology, Pennsylvania State University, University Park, PA, USA
⁵¹ Center for Medical Genomics, Pennsylvania State University, University Park, PA, USA
⁵² eGnome, Inc., Seoul, Republic of Korea
⁵³ Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
⁵⁴ Center for Evolutionary Hologenomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
⁵⁵ China National Genebank, BGI-Shenzhen, Shenzhen, China
⁵⁶ Institute of Molecular and Cell Biology, A*STAR, Biopolis, Singapore, Singapore
⁵⁷ Centre for Biodiversity, Royal Ontario Museum, Toronto, Ontario, Canada
⁵⁸ Smithsonian Conservation Biology Institute, Center for Species Survival, National Zoological Park, Washington, DC, USA
⁵⁹ Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
⁶⁰ Department of Biological Sciences, Earlham Institute, University of East Anglia, Norwich, UK
⁶¹ Institute of Evolutionary Biology (UPF-CSIC), PRBB, Barcelona, Spain
⁶² School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
⁶³ Department of Computer Science, The University of Illinois at Urbana-Champaign, Urbana, IL, USA
⁶⁴ School of Life Science, La Trobe University, Melbourne, Victoria, Australia
⁶⁵ Laboratory of Genomics Diversity-Center for Computer Technologies, ITMO University, St. Petersburg, Russian Federation
⁶⁶ Wellcome Sanger Institute, Cambridge, UK. kj2@sanger.ac.uk
⁶⁷ Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany. gene@mpi-cbg.de
⁶⁸ Department of Genetics, University of Cambridge, Cambridge, UK. rd109@cam.ac.uk
⁶⁹ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. adam.phillippy@nih.gov
⁷⁰ Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA. ejarvis@rockefeller.edu

Nature, 2021 Apr;592(7856):737-746.

PMID: 33911273 DOI: 10.1038/s41586-021-03451-0

Abstract

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.

MeSH terms

Similar publications