Current strategies to address data scarcity in artificial intelligence-based drug discovery: A comprehensive review

Gangwal A; Ansari A; Ahmad I; Azad AK; Wan Sulaiman WMA

doi:10.1016/j.compbiomed.2024.108734

Current strategies to address data scarcity in artificial intelligence-based drug discovery: A comprehensive review

Gangwal A ¹ , Ansari A ² , Ahmad I ³ , Azad AK ⁴ , Wan Sulaiman WMA ⁵

Affiliations

¹ Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule, 424001, Maharashtra, India. Electronic address: gangwal.amit@gmail.com
² Computer Aided Drug Design Center, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule, 424001, Maharashtra, India
³ Department of Pharmaceutical Chemistry, Prof. Ravindra Nikam College of Pharmacy, Gondur, Dhule, 424002, Maharashtra, India. Electronic address: ansariiqrar50@gmail.com
⁴ Faculty of Pharmacy, University College of MAIWP International, Batu Caves, 68100, Kuala Lumpur, Malaysia. Electronic address: azad2011iium@gmail.com
⁵ Faculty of Pharmacy, University College of MAIWP International, Batu Caves, 68100, Kuala Lumpur, Malaysia. Electronic address: drwanazizi@ucmi.edu.my

Comput Biol Med, 2024 Sep;179:108734.

PMID: 38964243 DOI: 10.1016/j.compbiomed.2024.108734

Abstract

Artificial intelligence (AI) has played a vital role in computer-aided drug design (CADD). This development has been further accelerated with the increasing use of machine learning (ML), mainly deep learning (DL), and computing hardware and software advancements. As a result, initial doubts about the application of AI in drug discovery have been dispelled, leading to significant benefits in medicinal chemistry. At the same time, it is crucial to recognize that AI is still in its infancy and faces a few limitations that need to be addressed to harness its full potential in drug discovery. Some notable limitations are insufficient, unlabeled, and non-uniform data, the resemblance of some AI-generated molecules with existing molecules, unavailability of inadequate benchmarks, intellectual property rights (IPRs) related hurdles in data sharing, poor understanding of biology, focus on proxy data and ligands, lack of holistic methods to represent input (molecular structures) to prevent pre-processing of input molecules (feature engineering), etc. The major component in AI infrastructure is input data, as most of the successes of AI-driven efforts to improve drug discovery depend on the quality and quantity of data, used to train and test AI algorithms, besides a few other factors. Additionally, data-gulping DL approaches, without sufficient data, may collapse to live up to their promise. Current literature suggests a few methods, to certain extent, effectively handle low data for better output from the AI models in the context of drug discovery. These are transferring learning (TL), active learning (AL), single or one-shot learning (OSL), multi-task learning (MTL), data augmentation (DA), data synthesis (DS), etc. One different method, which enables sharing of proprietary data on a common platform (without compromising data privacy) to train ML model, is federated learning (FL). In this review, we compare and discuss these methods, their recent applications, and limitations while modeling small molecule data to get the improved output of AI methods in drug discovery. Article also sums up some other novel methods to handle inadequate data.

* Title and MeSH Headings from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine.

MeSH terms

Similar publications