Rhie A 1 , McCarthy SA 2 , Fedrigo O 3 , Damas J 4 , Formenti G 3 , Koren S 1 Show all authors , Uliano-Silva M 5 , Chow W 6 , Fungtammasan A 7 , Kim J 8 , Lee C 8 , Ko BJ 9 , Chaisson M 10 , Gedman GL 11 , Cantin LJ 11 , Thibaud-Nissen F 12 , Haggerty L 13 , Bista I 2 , Smith M 6 , Haase B 3 , Mountcastle J 3 , Winkler S 14 , Paez S 3 , Howard J 15 , Vernes SC 16 , Lama TM 17 , Grutzner F 18 , Warren WC 19 , Balakrishnan CN 20 , Burt D 21 , George JM 22 , Biegler MT 11 , Iorns D 23 , Digby A 24 , Eason D 24 , Robertson B 25 , Edwards T 26 , Wilkinson M 27 , Turner G 28 , Meyer A 29 , Kautt AF 29 , Franchini P 29 , Detrich HW 30 , Svardal H 31 , Wagner M 32 , Naylor GJP 33 , Pippel M 14 , Malinsky M 6 , Mooney M 34 , Simbirsky M 7 , Hannigan BT 7 , Pesout T 35 , Houck M 36 , Misuraca A 36 , Kingan SB 37 , Hall R 37 , Kronenberg Z 37 , Sović I 37 , Dunn C 37 , Ning Z 6 , Hastie A 38 , Lee J 38 , Selvaraj S 39 , Green RE 35 , Putnam NH 40 , Gut I 41 , Ghurye J 42 , Garrison E 35 , Sims Y 6 , Collins J 6 , Pelan S 6 , Torrance J 6 , Tracey A 6 , Wood J 6 , Dagnew RE 10 , Guan D 2 , London SE 43 , Clayton DF 44 , Mello CV 45 , Friedrich SR 45 , Lovell PV 45 , Osipova E 14 , Al-Ajli FO 46 , Secomandi S 47 , Kim H 8 , Theofanopoulou C 11 , Hiller M 48 , Zhou Y 49 , Harris RS 50 , Makova KD 50 , Medvedev P 51 , Hoffman J 12 , Masterson P 12 , Clark K 12 , Martin F 13 , Howe K 13 , Flicek P 13 , Walenz BP 1 , Kwak W 52 , Clawson H 35 , Diekhans M 35 , Nassar L 35 , Paten B 35 , Kraus RHS 29 , Crawford AJ 53 , Gilbert MTP 54 , Zhang G 55 , Venkatesh B 56 , Murphy RW 57 , Koepfli KP 58 , Shapiro B 59 , Johnson WE 58 , Di Palma F 60 , Marques-Bonet T 61 , Teeling EC 62 , Warnow T 63 , Graves JM 64 , Ryder OA 36 , Haussler D 35 , O'Brien SJ 65 , Korlach J 37 , Lewin HA 4 , Howe K 66 , Myers EW 67 , Durbin R 68 , Phillippy AM 69 , Jarvis ED 70

Affiliations 

  • 1 Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
  • 2 Department of Genetics, University of Cambridge, Cambridge, UK
  • 3 Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
  • 4 The Genome Center, University of California Davis, Davis, CA, USA
  • 5 Leibniz Institute for Zoo and Wildlife Research, Department of Evolutionary Genetics, Berlin, Germany
  • 6 Wellcome Sanger Institute, Cambridge, UK
  • 7 DNAnexus Inc., Mountain View, CA, USA
  • 8 Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
  • 9 Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea
  • 10 University of Southern California, Los Angeles, CA, USA
  • 11 Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
  • 12 National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, USA
  • 13 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
  • 14 Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
  • 15 Novogene, Durham, NC, USA
  • 16 Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
  • 17 University of Massachusetts Cooperative Fish and Wildlife Research Unit, Amherst, MA, USA
  • 18 School of Biological Science, The Environment Institute, University of Adelaide, Adelaide, South Australia, Australia
  • 19 Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
  • 20 Department of Biology, East Carolina University, Greenville, NC, USA
  • 21 UQ Genomics, University of Queensland, Brisbane, Queensland, Australia
  • 22 Department of Biological Sciences, Clemson University, Clemson, SC, USA
  • 23 The Genetic Rescue Foundation, Wellington, New Zealand
  • 24 Kākāpō Recovery, Department of Conservation, Invercargill, New Zealand
  • 25 Department of Zoology, University of Otago, Dunedin, New Zealand
  • 26 University of Arizona Genetics Core, Tucson, AZ, USA
  • 27 Department of Life Sciences, Natural History Museum, London, UK
  • 28 School of Natural Sciences, Bangor University, Gwynedd, UK
  • 29 Department of Biology, University of Konstanz, Konstanz, Germany
  • 30 Department of Marine and Environmental Sciences, Northeastern University Marine Science Center, Nahant, MA, USA
  • 31 Department of Biology, University of Antwerp, Antwerp, Belgium
  • 32 Institute of Biology, Karl-Franzens University of Graz, Graz, Austria
  • 33 Florida Museum of Natural History, University of Florida, Gainesville, FL, USA
  • 34 Tag.bio, San Francisco, CA, USA
  • 35 UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
  • 36 San Diego Zoo Global, Escondido, CA, USA
  • 37 Pacific Biosciences, Menlo Park, CA, USA
  • 38 Bionano Genomics, San Diego, CA, USA
  • 39 Arima Genomics, San Diego, CA, USA
  • 40 Independent Researcher, Santa Cruz, CA, USA
  • 41 CNAG-CRG, Centre for Genomic Regulation, Barcelona Institute of Science and Technology, Barcelona, Spain
  • 42 Dovetail Genomics, Santa Cruz, CA, USA
  • 43 Department of Psychology, Institute for Mind and Biology, University of Chicago, Chicago, IL, USA
  • 44 Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
  • 45 Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, OR, USA
  • 46 Monash University Malaysia Genomics Facility, School of Science, Selangor Darul Ehsan, Malaysia
  • 47 Department of Biosciences, University of Milan, Milan, Italy
  • 48 LOEWE Centre for Translational Biodiversity Genomics, Frankfurt, Germany
  • 49 BGI-Shenzhen, Shenzhen, China
  • 50 Department of Biology, Pennsylvania State University, University Park, PA, USA
  • 51 Center for Medical Genomics, Pennsylvania State University, University Park, PA, USA
  • 52 eGnome, Inc., Seoul, Republic of Korea
  • 53 Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia
  • 54 Center for Evolutionary Hologenomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
  • 55 China National Genebank, BGI-Shenzhen, Shenzhen, China
  • 56 Institute of Molecular and Cell Biology, A*STAR, Biopolis, Singapore, Singapore
  • 57 Centre for Biodiversity, Royal Ontario Museum, Toronto, Ontario, Canada
  • 58 Smithsonian Conservation Biology Institute, Center for Species Survival, National Zoological Park, Washington, DC, USA
  • 59 Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA
  • 60 Department of Biological Sciences, Earlham Institute, University of East Anglia, Norwich, UK
  • 61 Institute of Evolutionary Biology (UPF-CSIC), PRBB, Barcelona, Spain
  • 62 School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
  • 63 Department of Computer Science, The University of Illinois at Urbana-Champaign, Urbana, IL, USA
  • 64 School of Life Science, La Trobe University, Melbourne, Victoria, Australia
  • 65 Laboratory of Genomics Diversity-Center for Computer Technologies, ITMO University, St. Petersburg, Russian Federation
  • 66 Wellcome Sanger Institute, Cambridge, UK. kj2@sanger.ac.uk
  • 67 Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany. gene@mpi-cbg.de
  • 68 Department of Genetics, University of Cambridge, Cambridge, UK. rd109@cam.ac.uk
  • 69 Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. adam.phillippy@nih.gov
  • 70 Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA. ejarvis@rockefeller.edu
Nature, 2021 Apr;592(7856):737-746.
PMID: 33911273 DOI: 10.1038/s41586-021-03451-0

Abstract

High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.