Affiliations 

  • 1 PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam. Electronic address: saeedmomo@hotmail.com
  • 2 Institute of Applied Data Analytics, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
  • 3 Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Istanbul, Turkey
  • 4 PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
  • 5 Faculty of Pharmacy, Quest International University, Ipoh, Malaysia
  • 6 Faculty of Data Science and Information Technology, INTI International University, Nilai, Malaysia
  • 7 PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam. Electronic address: longchiauming@gmail.com
J Mol Graph Model, 2022 Dec;117:108307.
PMID: 36096064 DOI: 10.1016/j.jmgm.2022.108307

Abstract

A Laplacian scoring algorithm for gene selection and the Gini coefficient to identify the genes whose expression varied least across a large set of samples were the state-of-the-art methods used here. These methods have not been trialed for their feasibility in cheminformatics. This was a maiden attempt to investigate a complete comparative analysis of an anthraquinone and chalcone derivatives-based virtual combinatorial library. This computational "proof-of-concept" study illustrated the combinatorial approach used to explain how the structure of the selected natural products (NPs) undergoes molecular diversity analysis. A virtual combinatorial library (1.6 M) based on 20 anthraquinones and 24 chalcones was enumerated. The resulting compounds were optimized to the near drug-likeness properties, and the physicochemical descriptors were calculated for all datasets including FDA, Non-FDA, and NPs from ZINC 15. UMAP and PCA were applied to compare and represent the chemical space coverage of each dataset. Subsequently, the Laplacian score and Gini coefficient were applied to delineate feature selection and selectivity among properties, respectively. Finally, we demonstrated the diversity between the datasets by employing Murcko's and the central scaffolds systems, calculating three fingerprint descriptors and analyzing their diversity by PCA and SOM. The optimized enumeration resulted in 1,610,268 compounds with NP-Likeness, and synthetic feasibility mean scores close to FDA, Non-FDA, and NPs datasets. The overlap between the chemical space of the 1.6 M database was more prominent than with the NPs dataset. A Laplacian score prioritized NP-likeness and hydrogen bond acceptor properties (1.0 and 0.923), respectively, while the Gini coefficient showed that all properties have selective effects on datasets (0.81-0.93). Scaffold and fingerprint diversity indicated that the descending order for the tested datasets was FDA, Non-FDA, NPs and 1.6 M. Virtual combinatorial libraries based on NPs can be considered as a source of the combinatorial compound with NP-likeness properties. Furthermore, measuring molecular diversity is supposed to be performed by different methods to allow for comparison and better judgment.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.