Ancestry-informative markers (AIMs) can be used to infer the ancestry of an individual to minimize the inaccuracy of self-reported ethnicity in biomedical research. In this study, we describe three methods for selecting AIM SNPs for the Malay population (Malay AIM panel) using different approaches based on pairwise FST, informativeness for assignment (In), and PCA-correlated SNPs (PCAIMs). These Malay AIM panels were extracted from genotype data stored in SNP arrays hosted by the Malaysian node of the Human Variome Project (MyHVP) and the Singapore Genome Variation Project (SGVP). In particular, genotype data from a total of 165 Malay individuals were analyzed, comprising data on 117 individual genotypes from the Affymetrix SNP-6 SNP array platform and data on 48 individual genotypes from the OMNI 2.5 Illumina SNP array platform. The HapMap phase 3 database (1397 individuals from 11 populations) was used as a reference for comparison with the Malay genotype data. The accuracy of each resulting Malay AIM panel was evaluated using a machine learning "ancestry-predictive model" constructed by using WEKA, a comprehensive machine learning platform written in Java. A total of 1250 SNPs were finally selected, which successfully identified Malay individuals from other world populations with an accuracy of 90%, but the accuracy decreased to 80% using 157 SNPs according to the pairwise FST method, while a panel of 200 SNPs selected using In and PCAIMs could be used to identify Malay individuals with an accuracy of approximately 80%.
β-Thalassemia/HbE disease has a wide spectrum of clinical phenotypes ranging from asymptomatic to dependent on regular blood transfusions. Ability to predict disease severity is helpful for clinical management and treatment decision making. A thalassemia severity score has been developed from Mediterranean β-thalassemia patients. However, different ethnic groups may have different allele frequency and linkage disequilibrium structures. Here, Thai β0-thalassemia/HbE disease genome-wild association studies (GWAS) data of 487 patients were analyzed by SNP interaction prioritization algorithm, interacting Loci (iLoci), to find predictive SNPs for disease severity. Three SNPs from two SNP interaction pairs associated with disease severity were identifies. The three-SNP disease severity risk score composed of rs766432 in BCL11A, rs9399137 in HBS1L-MYB and rs72872548 in HBE1 showed more than 85% specificity and 75% accuracy. The three-SNP predictive score was then validated in two independent cohorts of Thai and Malaysian β0-thalassemia/HbE patients with comparable specificity and accuracy. The SNP risk score could be used for prediction of clinical severity for Southeast Asia β0-thalassemia/HbE population.
Malay, the main ethnic group in Peninsular Malaysia, is represented by various sub-ethnic groups such as Melayu Banjar, Melayu Bugis, Melayu Champa, Melayu Java, Melayu Kedah Melayu Kelantan, Melayu Minang and Melayu Patani. Using data retrieved from the MyHVP (Malaysian Human Variome Project) database, a total of 135 individuals from these sub-ethnic groups were profiled using the Affymetrix GeneChip Mapping Xba 50-K single nucleotide polymorphism (SNP) array to identify SNPs that were ancestry-informative markers (AIMs) for Malays of Peninsular Malaysia. Prior to selecting the AIMs, the genetic structure of Malays was explored with reference to 11 other populations obtained from the Pan-Asian SNP Consortium database using principal component analysis (PCA) and ADMIXTURE. Iterative pruning principal component analysis (ipPCA) was further used to identify sub-groups of Malays. Subsequently, we constructed an AIMs panel for Malays using the informativeness for assignment (In) of genetic markers, and the K-nearest neighbor classifier (KNN) was used to teach the classification models. A model of 250 SNPs ranked by In, correctly classified Malay individuals with an accuracy of up to 90%. The identified panel of SNPs could be utilized as a panel of AIMs to ascertain the specific ancestry of Malays, which may be useful in disease association studies, biomedical research or forensic investigation purposes.
The Asian Diversity Project (ADP) assembled 37 cosmopolitan and ethnic minority populations in Asia that have been densely genotyped across over half a million markers to study patterns of genetic diversity and positive natural selection. We performed population structure analyses of the ADP populations and divided these populations into four major groups based on their genographic information. By applying a highly sensitive algorithm haploPS to locate genomic signatures of positive selection, 140 distinct genomic regions exhibiting evidence of positive selection in at least one population were identified. We examined the extent of signal sharing for regions that were selected in multiple populations and observed that populations clustered in a similar fashion to that of how the ancestry clades were phylogenetically defined. In particular, populations predominantly located in South Asia underwent considerably different adaptation as compared with populations from the other geographical regions. Signatures of positive selection present in multiple geographical regions were predicted to be older and have emerged prior to the separation of the populations in the different regions. In contrast, selection signals present in a single population group tended to be of lower frequencies and thus can be attributed to recent evolutionary events.