For a given DNA sequence, it is well known that pair wise alignment schemes are used to determine the similarity with the DNA sequences available in the databanks. The efficiency of the alignment decides the type of amino acids and its corresponding proteins. In order to evaluate the given DNA sequence for its proteomic identity, a pattern matching approach is proposed in this paper. A block based semi-global alignment scheme is introduced to determine the similarity between the DNA sequences (known and given). The two DNA sequences are divided into blocks of equal length and alignment is performed which minimizes the computational complexity. The efficiency of the alignment scheme is evaluated using the parameter, percentage of similarity (POS). Four essential DNA version of the amino acids that emphasize the importance of proteomic functionalities are chosen as patterns and matching is performed with the known and given DNA sequences to determine the similarity between them. The ratio of amino acid counts between the two sequences is estimated and the results are compared with that of the POS value. It is found from the experimental results that higher the POS value and the pattern matching higher are the similarity between the two DNA sequences. The optimal block is also identified based on the POS value and amino acids count.
Single-nucleotide polymorphisms (SNPs) are the most common genetic variations for various complex human diseases, including cancers. Genome-wide association studies (GWAS) have identified numerous SNPs that increase cancer risks, such as breast cancer, colorectal cancer, and leukemia. These SNPs were cataloged for scientific use. However, GWAS are often conducted on certain populations in which the Orang Asli and Malays were not included. Therefore, we have developed a bioinformatic pipeline to mine the whole-genome sequence databases of the Orang Asli and Malays to determine the presence of pathogenic SNPs that might increase the risks of cancers among them. Five different in silico tools, SIFT, PROVEAN, Poly-Phen-2, Condel, and PANTHER, were used to predict and assess the functional impacts of the SNPs. Out of the 80 cancer-related nsSNPs from the GWAS dataset, 52 nsSNPs were found among the Orang Asli and Malays. They were further analyzed using the bioinformatic pipeline to identify the pathogenic variants. Three nsSNPs; rs1126809 (TYR), rs10936600 (LRRC34), and rs757978 (FARP2), were found as the most damaging cancer pathogenic variants. These mutations alter the protein interface and change the allosteric sites of the respective proteins. As TYR, LRRC34, and FARP2 genes play important roles in numerous cellular processes such as cell proliferation, differentiation, growth, and cell survival; therefore, any impairment on the protein function could be involved in the development of cancer. rs1126809, rs10936600, and rs757978 are the important pathogenic variants that increase the risks of cancers among the Orang Asli and Malays. The roles and impacts of these variants in cancers will require further investigations using in vitro cancer models.
Matched MeSH terms: Databases, Genetic/statistics & numerical data