Materials and Methods: Original research studies associating genetic features and normal tissue complications following radiation therapy were identified from PubMed. The distribution of radiogenomic studies was determined by mining the statement of country of origin and racial/ancestrial distribution and the inclusion in analyses. Descriptive analyses were performed to determine the distribution of studies across races/ancestries, countries, and continents and the inclusion in analyses.
Results: Among 174 studies, only 23 with a population of more one race/ancestry which were predominantly conducted in the United States. Across the continents, most studies were performed in Europe (77 studies averaging at 30.6 patients/million population [pt/mil]), North America (46 studies, 20.8 pt/mil), Asia (46 studies, 2.4 pt/mil), South America (3 studies, 0.4 pt/mil), Oceania (2 studies, 2.1 pt/mil), and none from Africa. All 23 studies with more than one race/ancestry considered race/ancestry as a covariate, and three studies showed race/ancestry to be significantly associated with endpoints.
Conclusion: Most toxicity-related radiogenomic studies involved a single race/ancestry. Individual Participant Data meta-analyses or multinational studies need to be encouraged.
RESULTS: In this study we generated Whole Exome Sequencing (WES), Reduced Representation Bisulfite Sequencing (RRBS) and RNA sequencing (RNA-seq) data from samples with known mixtures of mouse and human DNA or RNA and from a cohort of human breast cancers and their derived PDTXs. We show that using an In silico Combined human-mouse Reference Genome (ICRG) for alignment discriminates between human and mouse reads with up to 99.9% accuracy and decreases the number of false positive somatic mutations caused by misalignment by >99.9%. We also derived a model to estimate the human DNA content in independent PDTX samples. For RNA-seq and RRBS data analysis, the use of the ICRG allows dissecting computationally the transcriptome and methylome of human tumour cells and mouse stroma. In a direct comparison with previously reported approaches, our method showed similar or higher accuracy while requiring significantly less computing time.
CONCLUSIONS: The computational pipeline we describe here is a valuable tool for the molecular analysis of PDTXs as well as any other mixture of DNA or RNA species.
RESULTS: We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure).
CONCLUSIONS: Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.