Fin whales (Balaenoptera physalus) and blue whales (B. musculus) are the two largest species on Earth and are widely distributed across the world's oceans. Hybrids between these species appear to be relatively widespread and have been reported in both the North Atlantic and North Pacific; they are also relatively common, and have been proposed to occur once in every thousand fin whales. However, despite known hybridization, fin and blue whales are not sibling species. Rather, the closest living relative of fin whales are humpback whales (Megaptera novaeangliae). To improve the quality of fin whale data available for analysis, we assembled and annotated a fin whale nuclear genome using in-silico mate pair libraries and previously published short-read data. Using this assembly and genomic data from a humpback, blue, and bowhead whale, we investigated whether signatures of introgression between the fin and blue whale could be found. We find no signatures of contemporary admixture in the fin and blue whale genomes, although our analyses support ancestral gene flow between the species until 2.4-1.3 Ma. We propose the following explanations for our findings; i) fin/blue whale hybridization does not occur in the populations our samples originate from, ii) contemporary hybrids are a recent phenomenon and the genetic consequences have yet to become widespread across populations, or iii) fin/blue whale hybrids are under large negative selection, preventing them from backcrossing and contributing to the parental gene pools.
Single-exon coding sequences (CDSs), also known as 'single-exon genes' (SEGs), are defined as nuclear, protein-coding genes that lack introns in their CDSs. They have been studied not only to determine their origin and evolution but also because their expression has been linked to several types of human cancers and neurological/developmental disorders, and many exhibit tissue-specific transcription. We developed SinEx DB that houses DNA and protein sequence information of SEGs from 10 mammalian genomes including human. SinEx DB includes their functional predictions (KOG (euKaryotic Orthologous Groups)) and the relative distribution of these functions within species. Here, we report SinEx 2.0, a major update of SinEx DB that includes information of the occurrence, distribution and functional prediction of SEGs from 60 completely sequenced eukaryotic genomes, representing animals, fungi, protists and plants. The information is stored in a relational database built with MySQL Server 5.7, and the complete dataset of SEG sequences and their GO (Gene Ontology) functional assignations are available for downloading. SinEx DB 2.0 was built with a novel pipeline that helps disambiguate single-exon isoforms from SEGs. SinEx DB 2.0 is the largest available database for SEGs and provides a rich source of information for advancing our understanding of the evolution, function of SEGs and their associations with disorders including cancers and neurological and developmental diseases. Database URL: http://v2.sinex.cl/.
Efficient extraction of knowledge from biological data requires the development of structured vocabularies to unambiguously define biological terms. This paper proposes descriptions and definitions to disambiguate the term 'single-exon gene'. Eukaryotic Single-Exon Genes (SEGs) have been defined as genes that do not have introns in their protein coding sequences. They have been studied not only to determine their origin and evolution but also because their expression has been linked to several types of human cancer and neurological/developmental disorders and many exhibit tissue-specific transcription. Unfortunately, the term 'SEGs' is rife with ambiguity, leading to biological misinterpretations. In the classic definition, no distinction is made between SEGs that harbor introns in their untranslated regions (UTRs) versus those without. This distinction is important to make because the presence of introns in UTRs affects transcriptional regulation and post-transcriptional processing of the mRNA. In addition, recent whole-transcriptome shotgun sequencing has led to the discovery of many examples of single-exon mRNAs that arise from alternative splicing of multi-exon genes, these single-exon isoforms are being confused with SEGs despite their clearly different origin. The increasing expansion of RNA-seq datasets makes it imperative to distinguish the different SEG types before annotation errors become indelibly propagated in biological databases. This paper develops a structured vocabulary for their disambiguation, allowing a major reassessment of their evolutionary trajectories, regulation, RNA processing and transport, and provides the opportunity to improve the detection of gene associations with disorders including cancers, neurological and developmental diseases.
The narwhal (Monodon monoceros) is a highly specialized endemic Arctic cetacean, restricted to the Arctic seas bordering the North Atlantic. Low levels of genetic diversity have been observed across several narwhal populations using mitochondrial DNA and microsatellites. Despite this, the global abundance of narwhals was recently estimated at ∼170,000 individuals. However, the species is still considered vulnerable to changing climates due to its high specialization and restricted Arctic distribution. We assembled and annotated a genome from a narwhal from West Greenland. We find relatively low diversity at the genomic scale and show that this did not arise by recent inbreeding, but rather has been stable over an extended evolutionary timescale. We also find that the current large global abundance most likely reflects a recent rapid expansion from a much smaller founding population.
Copepoda is one of the most ecologically important animal groups on Earth, yet very few genetic resources are available for this Subclass. Here, we present the first whole genome sequence (WGS, acc. UYDY01) and the first mRNA transcriptome assembly (TSA, Acc. GHAJ01) for the tropical cyclopoid copepod species Apocyclops royi Until now, only the 18S small subunit of ribosomal RNA gene and the COI gene has been available from A. royi, and WGS resources was only available from one other cyclopoid copepod species. Overall, the provided resources are the 8th copepod species to have WGS resources available and the 19th copepod species with TSA information available. We analyze the length and GC content of the provided WGS scaffolds as well as the coverage and gene content of both the WGS and the TSA assembly. Finally, we place the resources within the copepod order Cyclopoida as a member of the Apocyclops genus. We estimate the total genome size of A. royi to 450 Mb, with 181 Mb assembled nonrepetitive sequence, 76 Mb assembled repeats and 193 Mb unassembled sequence. The TSA assembly consists of 29,737 genes and an additional 45,756 isoforms. In the WGS and TSA assemblies, >80% and >95% of core genes can be found, though many in fragmented versions. The provided resources will allow researchers to conduct physiological experiments on A. royi, and also increase the possibilities for copepod gene set analysis, as it adds substantially to the copepod datasets available.
Mobile genetic elements (MGEs) are instrumental in natural prokaryotic genome editing, permitting genome plasticity and allowing microbes to accumulate genetic diversity. MGEs serve as a vast communal gene pool and include DNA elements such as plasmids and bacteriophages (phages) among others. These mobile DNA elements represent a human health risk as they can introduce new traits, such as antibiotic resistance or virulence, to a bacterial strain. Sequencing libraries targeting environmental circular MGEs, referred to as metamobilomes, may broaden our current understanding of the mechanisms behind the mobility, prevalence and content of these elements. However, metamobilomics is affected by a severe bias towards small circular elements, introduced by multiple displacement amplification (MDA). MDA is typically used to overcome limiting DNA quantities after the removal of non-circular DNA during library preparations. By examining the relationship between sequencing coverage and the size of circular MGEs in paired metamobilome datasets with and without MDA, we show that larger circular elements are lost when using MDA. This study is the first to systematically demonstrate that MDA is detrimental to detecting larger-sized plasmids if small plasmids are present. It is also the first to show that MDA can be omitted when using enzyme-based DNA fragmentation and PCR in library preparation kits such as Nextera XT® from Illumina.
Recent advances in machine learning and natural language processing have made it possible to profoundly advance our ability to accurately predict protein structures and their functions. While such improvements are significantly impacting the fields of biology and biotechnology at large, such methods have the downside of high demands in terms of computing power and runtime, hampering their applicability to large datasets. Here, we present NetSurfP-3.0, a tool for predicting solvent accessibility, secondary structure, structural disorder and backbone dihedral angles for each residue of an amino acid sequence. This NetSurfP update exploits recent advances in pre-trained protein language models to drastically improve the runtime of its predecessor by two orders of magnitude, while displaying similar prediction performance. We assessed the accuracy of NetSurfP-3.0 on several independent test datasets and found it to consistently produce state-of-the-art predictions for each of its output features, with a runtime that is up to to 600 times faster than the most commonly available methods performing the same tasks. The tool is freely available as a web server with a user-friendly interface to navigate the results, as well as a standalone downloadable package.
Plant-derived terpenoids are extensively used in perfume, food, cosmetic and pharmaceutical industries, and several attempts are being made to produce terpenes in heterologous hosts. Native hosts have evolved to accumulate large quantities of terpenes in specialized cells. However, heterologous cells lack the capacity needed to produce and store high amounts of non-native terpenes, leading to reduced growth and loss of volatile terpenes by evaporation. Here, we describe how to direct the sesquiterpene patchoulol production into cytoplasmic lipid droplets (LDs) in Physcomitrium patens (syn. Physcomitrella patens), by attaching patchoulol synthase (PTS) to proteins linked to plant LD biogenesis. Three different LD-proteins: Oleosin (PpOLE1), Lipid Droplet Associated Protein (AtLDAP1) and Seipin (PpSeipin325) were tested as anchors. Ectopic expression of PTS increased the number and size of LDs, implying an unknown mechanism between heterologous terpene production and LD biogenesis. The expression of PTS physically linked to Seipin increased the LD size and the retention of patchoulol in the cell. Overall, the expression of PTS was lower in the anchored mutants than in the control, but when normalized to the expression the production of patchoulol was higher in the seipin-linked mutants.
The African hunting dog (Lycaon pictus, 2n = 78) once ranged over most sub-Saharan ecosystems except its deserts and rainforests. However, as a result of (still ongoing) population declines, today they remain only as small fragmented populations. Furthermore, the future of the species remains unclear, due to both anthropogenic pressure and interactions with domestic dogs, thus their preservation is a conservation priority. On the tree of life, the hunting dog is basal to Canis and Cuon and forms a crown group with them, making it a useful species for comparative genomic studies. Here, we present a diploid chromosome-level assembly of an African hunting dog. Assembled according to Vertebrate Genomes Project guidelines from a combination of PacBio HiFi reads and HiC data, it is phased at the level of individual chromosomes. The maternal (pseudo)haplotype (mat) of our assembly has a length of 2.38 Gbp, and 99.36% of the sequence is encompassed by 39 chromosomal scaffolds. The rest is included in only 36 unplaced short scaffolds. At the contig level, the mat consists of only 166 contigs with an N50 of 39 Mbp. BUSCO (Benchmarking Universal Single-Copy Orthologue) analysis showed 95.4% completeness based on Carnivora conservative genes (carnivora_odb10). When compared with other available genomes from subtribe Canina, the quality of the assembly is excellent, typically between the first and third depending on the parameter used, and a significant improvement on previously published genomes for the species. We hope this assembly will play an important role in future conservation efforts and comparative studies of canid genomes.
Donkeys and horses share a common ancestor dating back to about 4 million years ago. Although a high-quality genome assembly at the chromosomal level is available for the horse, current assemblies available for the donkey are limited to moderately sized scaffolds. The absence of a better-quality assembly for the donkey has hampered studies involving the characterization of patterns of genetic variation at the genome-wide scale. These range from the application of genomic tools to selective breeding and conservation to the more fundamental characterization of the genomic loci underlying speciation and domestication. We present a new high-quality donkey genome assembly obtained using the Chicago HiRise assembly technology, providing scaffolds of subchromosomal size. We make use of this new assembly to obtain more accurate measures of heterozygosity for equine species other than the horse, both genome-wide and locally, and to detect runs of homozygosity potentially pertaining to positive selection in domestic donkeys. Finally, this new assembly allowed us to identify fine-scale chromosomal rearrangements between the horse and the donkey that likely played an active role in their divergence and, ultimately, speciation.
We characterized the complete genome sequence of the lytic Salmonella enterica bacteriophage PRF-SP1, isolated from Penang National Park, a conserved rainforest in northern Malaysia. The novel phage species from the Autographiviridae family has a 39,966-bp double-stranded DNA (dsDNA) genome containing 49 protein-encoding genes and shares 90.96% similarity with Escherichia phage DY1.
Members of the crustacean subclass Copepoda are likely the most abundant metazoans worldwide. Pelagic marine species are critical in converting planktonic microalgae to animal biomass, supporting oceanic food webs. Despite their abundance and ecological importance, only six copepod genomes are publicly available, owing to a number of factors including large genome size, repetitiveness, GC-content, and small animal size. Here, we report the seventh representative copepod genome and the first genome and the first transcriptome from the calanoid copepod species Acartia tonsa Dana, which is among the most numerous mesozooplankton in boreal coastal and estuarine waters. The ecology, physiology, and behavior of A. tonsa have been studied extensively. The genetic resources contributed in this work will allow researchers to link experimental results to molecular mechanisms. From PCR-free whole genome sequence and mRNA Illumina data, we assemble the largest copepod genome to date. We estimate that A. tonsa has a total genome size of 2.5 Gb including repetitive elements we could not resolve. The nonrepetitive fraction of the genome assembly is estimated to be 566 Mb. Our DNA sequencing-based analyses suggest there is a 14-fold difference in genome size between the six members of Copepoda with available genomic information. This finding complements nucleus staining genome size estimations, where 100-fold difference has been reported within 70 species. We briefly analyze the repeat structure in the existing copepod whole genome sequence data sets. The information presented here confirms the evolution of genome size in Copepoda and expands the scope for evolutionary inferences in Copepoda by providing several levels of genetic information from a key planktonic crustacean species.
The American mink (Neovison vison) is a semiaquatic species of mustelid native to North America. It's an important animal for the fur industry. Many efforts have been made to locate genes influencing fur quality and color, but this search has been impeded by the lack of a reference genome. Here we present the first draft genome of mink. In our study, two mink individuals were sequenced by Illumina sequencing with 797 Gb sequence generated. Assembly yielded 7,175 scaffolds with an N50 of 6.3 Mb and length of 2.4 Gb including gaps. Repeat sequences constitute around 31% of the genome, which is lower than for dog and cat genomes. The alignments of mink, ferret and dog genomes help to illustrate the chromosomes rearrangement. Gene annotation identified 21,053 protein-coding sequences present in mink genome. The reference genome's structure is consistent with the microsatellite-based genetic map. Mapping of well-studied genes known to be involved in coat quality and coat color, and previously located fur quality QTL provide new knowledge about putative candidate genes for fur traits. The draft genome shows great potential to facilitate genomic research towards improved breeding for high fur quality animals and strengthen our understanding on evolution of Carnivora.
Spiders (Araneae) have a diverse spectrum of morphologies, behaviors, and physiologies. Attempts to understand the genomic-basis of this diversity are often hindered by their large, heterozygous, and AT-rich genomes with high repeat content resulting in highly fragmented, poor-quality assemblies. As a result, the key attributes of spider genomes, including gene family evolution, repeat content, and gene function, remain poorly understood. Here, we used Illumina and Dovetail Chicago technologies to sequence the genome of the long-jawed spider Tetragnatha kauaiensis, producing an assembly distributed along 3,925 scaffolds with an N50 of ∼2 Mb. Using comparative genomics tools, we explore genome evolution across available spider assemblies. Our findings suggest that the previously reported and vast genome size variation in spiders is linked to the different representation and number of transposable elements. Using statistical tools to uncover gene-family level evolution, we find expansions associated with the sensory perception of taste, immunity, and metabolism. In addition, we report strikingly different histories of chemosensory, venom, and silk gene families, with the first two evolving much earlier, affected by the ancestral whole genome duplication in Arachnopulmonata (∼450 Ma) and exhibiting higher numbers. Together, our findings reveal that spider genomes are highly variable and that genomic novelty may have been driven by the burst of an ancient whole genome duplication, followed by gene family and transposable element expansion.
The ability to predict local structural features of a protein from the primary sequence is of paramount importance for unraveling its function in absence of experimental structural information. Two main factors affect the utility of potential prediction tools: their accuracy must enable extraction of reliable structural information on the proteins of interest, and their runtime must be low to keep pace with sequencing data being generated at a constantly increasing speed. Here, we present NetSurfP-2.0, a novel tool that can predict the most important local structural features with unprecedented accuracy and runtime. NetSurfP-2.0 is sequence-based and uses an architecture composed of convolutional and long short-term memory neural networks trained on solved protein structures. Using a single integrated model, NetSurfP-2.0 predicts solvent accessibility, secondary structure, structural disorder, and backbone dihedral angles for each residue of the input sequences. We assessed the accuracy of NetSurfP-2.0 on several independent test datasets and found it to consistently produce state-of-the-art predictions for each of its output features. We observe a correlation of 80% between predictions and experimental data for solvent accessibility, and a precision of 85% on secondary structure 3-class predictions. In addition to improved accuracy, the processing time has been optimized to allow predicting more than 1000 proteins in less than 2 hours, and complete proteomes in less than 1 day.
Salmonella infections across the globe are becoming more challenging to control due to the emergence of multidrug-resistant (MDR) strains. Lytic phages may be suitable alternatives for treating these multidrug-resistant Salmonella infections. Most Salmonella phages to date were collected from human-impacted environments. To further explore the Salmonella phage space, and to potentially identify phages with novel characteristics, we characterized Salmonella-specific phages isolated from the Penang National Park, a conserved rainforest. Four phages with a broad lytic spectrum (kills >5 Salmonella serovars) were further characterized; they have isometric heads and cone-shaped tails, and genomes of ~39,900 bp, encoding 49 CDSs. As the genomes share a <95% sequence similarity to known genomes, the phages were classified as a new species within the genus Kayfunavirus. Interestingly, the phages displayed obvious differences in their lytic spectrum and pH stability, despite having a high sequence similarity (~99% ANI). Subsequent analysis revealed that the phages differed in the nucleotide sequence in the tail spike proteins, tail tubular proteins, and portal proteins, suggesting that the SNPs were responsible for their differing phenotypes. Our findings highlight the diversity of novel Salmonella bacteriophages from rainforest regions, which can be explored as an antimicrobial agent against MDR-Salmonella strains.
Here, we present the complete genome of a plant growth-promoting strain, Bacillus stratosphericus AIMST-CREST02 isolated from the bulk soil of a high-yielding paddy plot. The genome is 3,840,451 bp in size with a GC content of 41.25%. Annotation predicted the presence of 3,907 coding sequences, including genes involved in auxin biosynthesis regulation and gamma-aminobutyric acid (GABA) metabolism.
We present the complete genome of a potential plant growth-promoting bacteria Bacillus altitudinis AIMST-CREST03 isolated from a high-yielding paddy plot. The genome is 3,669,202 bp in size with a GC content of 41%. Annotation predicted 3,327 coding sequences, including several genes required for plant growth promotion.
The diverse array of phenotypes and courtship displays exhibited by birds-of-paradise have long fascinated scientists and nonscientists alike. Remarkably, almost nothing is known about the genomics of this iconic radiation. There are 41 species in 16 genera currently recognized within the birds-of-paradise family (Paradisaeidae), most of which are endemic to the island of New Guinea. In this study, we sequenced genomes of representatives from all five major clades within this family to characterize genomic changes that may have played a role in the evolution of the group's extensive phenotypic diversity. We found genes important for coloration, morphology, and feather and eye development to be under positive selection. In birds-of-paradise with complex lekking systems and strong sexual dimorphism, the core birds-of-paradise, we found Gene Ontology categories for "startle response" and "olfactory receptor activity" to be enriched among the gene families expanding significantly faster compared to the other birds in our study. Furthermore, we found novel families of retrovirus-like retrotransposons active in all three de novo genomes since the early diversification of the birds-of-paradise group, which might have played a role in the evolution of this fascinating group of birds.
Tropical islands are renowned as natural laboratories for evolutionary study. Lineage radiations across tropical archipelagos are ideal systems for investigating how colonization, speciation, and extinction processes shape biodiversity patterns. The expansion of the island thrush across the Indo-Pacific represents one of the largest yet most perplexing island radiations of any songbird species. The island thrush exhibits a complex mosaic of pronounced plumage variation across its range and is arguably the world's most polytypic bird. It is a sedentary species largely restricted to mountain forests, yet it has colonized a vast island region spanning a quarter of the globe. We conducted a comprehensive sampling of island thrush populations and obtained genome-wide SNP data, which we used to reconstruct its phylogeny, population structure, gene flow, and demographic history. The island thrush evolved from migratory Palearctic ancestors and radiated explosively across the Indo-Pacific during the Pleistocene, with numerous instances of gene flow between populations. Its bewildering plumage variation masks a biogeographically intuitive stepping stone colonization path from the Philippines through the Greater Sundas, Wallacea, and New Guinea to Polynesia. The island thrush's success in colonizing Indo-Pacific mountains can be understood in light of its ancestral mobility and adaptation to cool climates; however, shifts in elevational range, degree of plumage variation and apparent dispersal rates in the eastern part of its range raise further intriguing questions about its biology.