Understanding the relationship between genotypes and phenotypes is essential to disentangle biological mechanisms and to unravel the molecular basis of diseases. Genes and proteins are closely linked in biological systems. However, genomics and proteomics have developed separately into two distinct disciplines whereby crosstalk among scientists from the two domains is limited and this constrains the integration of both fields into a single data modality of useful information. The emerging field of proteogenomics attempts to address this by building bridges between the two disciplines. In this review, how genomics and transcriptomics data in different formats can be utilized to assist proteogenomics application is briefly discussed. Subsequently, a much larger part of this review focuses on proteogenomics research articles that are published in the last five years that answer two important questions. First, how proteogenomics can be applied to tackle biological problems is discussed, covering genome annotation and precision medicine. Second, the latest developments in analytical technologies for data acquisition and the bioinformatics tools to interpret and visualize proteogenomics data are covered.
One of the best-established area within multi-omics is proteogenomics, whereby the underpinning technologies are next-generation sequencing (NGS) and mass spectrometry (MS). Proteogenomics has contributed significantly to genome (re)-annotation, whereby novel coding sequences (CDS) are identified and confirmed. By incorporating in-silico translated genome variants in protein database, single amino acid variants (SAAV) and splice proteoforms can be identified and quantified at peptide level. The application of proteogenomics in cancer research potentially enables the identification of patient-specific proteoforms, as well as the association of the efficacy or resistance of cancer therapy to different mutations. Here, we discuss how NGS/TGS data are analyzed and incorporated into the proteogenomic framework. These sequence data mainly originate from whole genome sequencing (WGS), whole exome sequencing (WES) and RNA-Seq. We explain two major strategies for sequence analysis i.e., de novo assembly and reads mapping, followed by construction of customized protein databases using such data. Besides, we also elaborate on the procedures of spectrum to peptide sequence matching in proteogenomics, and the relationship between database size on the false discovery rate (FDR). Finally, we discuss the latest development in proteogenomics-assisted precision oncology and also challenges and opportunities in proteogenomics research.