A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Habibi N; Mohd Hashim SZ; Norouzi A; Samian MR

doi:10.1186/1471-2105-15-134

Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

Affiliations

Abstract