RESULTS: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure).
CONCLUSIONS: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.
RESULTS: In this study we generated Whole Exome Sequencing (WES), Reduced Representation Bisulfite Sequencing (RRBS) and RNA sequencing (RNA-seq) data from samples with known mixtures of mouse and human DNA or RNA and from a cohort of human breast cancers and their derived PDTXs. We show that using an In silico Combined human-mouse Reference Genome (ICRG) for alignment discriminates between human and mouse reads with up to 99.9% accuracy and decreases the number of false positive somatic mutations caused by misalignment by >99.9%. We also derived a model to estimate the human DNA content in independent PDTX samples. For RNA-seq and RRBS data analysis, the use of the ICRG allows dissecting computationally the transcriptome and methylome of human tumour cells and mouse stroma. In a direct comparison with previously reported approaches, our method showed similar or higher accuracy while requiring significantly less computing time.
CONCLUSIONS: The computational pipeline we describe here is a valuable tool for the molecular analysis of PDTXs as well as any other mixture of DNA or RNA species.
RESULTS: In the genomic analysis, 33 homozygous and 1377 heterozygous mutations in the coding sequences of the genome of MT strain were detected. Among these heterozygous mutations, the proportion of mutated reads in each gene was different, ranging from 21 to 75%. These results suggest that the MT strain may contain multiple nuclei containing different mutations. We tried to isolate haploid spores from the MT strain to prove its ploidy, but this strain did not sporulate under the conditions tested. Heterozygous mutations detected in genes which are important for sporulation likely contribute to the sporulation deficiency of the MT strain. Homozygous and heterozygous mutations were found in genes encoding enzymes involved in amino acid metabolism, the TCA cycle, purine and pyrimidine nucleotide metabolism and the DNA mismatch repair system. One homozygous mutation in AgILV2 gene encoding acetohydroxyacid synthase, which is also a flavoprotein in mitochondria, was found. Gene ontology (GO) enrichment analysis showed heterozygous mutations in all 22 DNA helicase genes and genes involved in oxidation-reduction process.
CONCLUSION: This study suggests that oxidative stress and the aging of cells were involved in the riboflavin over-production in A. gossypii riboflavin over-producing mutant and provides new insights into riboflavin production in A. gossypii and the usefulness of disparity mutagenesis for the creation of new types of mutants for metabolic engineering.