In structural biology, similarity analysis of protein structure is a crucial step in studying the relationship between proteins. Despite the considerable number of techniques that have been explored within the past two decades, the development of new alternative methods is still an active research area due to the need for high performance tools.
A drastic improvement in the analysis of gene expression has lead to new discoveries in bioinformatics research. In order to analyse the gene expression data, fuzzy clustering algorithms are widely used. However, the resulting analyses from these specific types of algorithms may lead to confusion in hypotheses with regard to the suggestion of dominant function for genes of interest. Besides that, the current fuzzy clustering algorithms do not conduct a thorough analysis of genes with low membership values. Therefore, we present a novel computational framework called the "multi-stage filtering-Clustering Functional Annotation" (msf-CluFA) for clustering gene expression data. The framework consists of four components: fuzzy c-means clustering (msf-CluFA-0), achieving dominant cluster (msf-CluFA-1), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2 (msf-CluFA-3). By employing double filtering in msf-CluFA-1 and apriori algorithms in msf-CluFA-2, our new framework is capable of determining the dominant clusters and improving the confidence level of genes with lower membership values by means of which the unknown genes can be predicted.
Understanding the mechanisms of gene regulation during breast cancer is one of the most difficult problems among oncologists because this regulation is likely comprised of complex genetic interactions. Given this complexity, a computational study using the Bayesian network technique has been employed to construct a gene regulatory network from microarray data. Although the Bayesian network has been notified as a prominent method to infer gene regulatory processes, learning the Bayesian network structure is NP hard and computationally intricate. Therefore, we propose a novel inference method based on low-order conditional independence that extends to the case of the Bayesian network to deal with a large number of genes and an insufficient sample size. This method has been evaluated and compared with full-order conditional independence and different prognostic indices on a publicly available breast cancer data set. Our results suggest that the low-order conditional independence method will be able to handle a large number of genes in a small sample size with the least mean square error. In addition, this proposed method performs significantly better than other methods, including the full-order conditional independence and the St. Gallen consensus criteria. The proposed method achieved an area under the ROC curve of 0.79203, whereas the full-order conditional independence and the St. Gallen consensus criteria obtained 0.76438 and 0.73810, respectively. Furthermore, our empirical evaluation using the low-order conditional independence method has demonstrated a promising relationship between six gene regulators and two regulated genes and will be further investigated as potential breast cancer metastasis prognostic markers.
A genetic similarity algorithm is introduced in this study to find a group of semantically similar Gene Ontology terms. The genetic similarity algorithm combines semantic similarity measure algorithm with parallel genetic algorithm. The semantic similarity measure algorithm is used to compute the similitude strength between the Gene Ontology terms. Then, the parallel genetic algorithm is employed to perform batch retrieval and to accelerate the search in large search space of the Gene Ontology graph. The genetic similarity algorithm is implemented in the Gene Ontology browser named basic UTMGO to overcome the weaknesses of the existing Gene Ontology browsers which use a conventional approach based on keyword matching. To show the applicability of the basic UTMGO, we extend its structure to develop a Gene Ontology -based protein sequence annotation tool named extended UTMGO. The objective of developing the extended UTMGO is to provide a simple and practical tool that is capable of producing better results and requires a reasonable amount of running time with low computing cost specifically for offline usage. The computational results and comparison with other related tools are presented to show the effectiveness of the proposed algorithm and tools.
Path testing is the basic approach of white box testing and the main approach to solve it by discovering the particular input data of the searching space to encompass the paths in the software under test. Due to the increasing software complexity, exhaustive testing is impossible and computationally not feasible. The ultimate challenge is to generate suitable test data that maximize the coverage; many approaches have been developed by researchers to accomplish path coverage. The paper suggested a hybrid method (NSA-GA) based on Negative Selection Algorithm (NSA) and Genetic Algorithm (GA) to generate an optimal test data avoiding replication to cover all possible paths. The proposed method modifies the generation of detectors in the generation phase of NSA using GA, as well as, develops a fitness function based on the paths' prioritization. Different benchmark programs with different data types have been used. The results show that the hybrid method improved the coverage percentage of the programs' paths, even for complicated paths and its ability to minimize the generated number of test data and enhance the efficiency even with the increased input range of different data types used. This method improves the effectiveness and efficiency of test data generation and maximizes search space area, increasing percentage of path coverage while preventing redundant data.
Protein structure alignment and comparisons that are based on an alphabetical demonstration of protein structure are more simple to run with faster evaluation processes; thus, their accuracy is not as reliable as three-dimension (3D)-based tools. As a 1D method candidate, TS-AMIR used the alphabetic demonstration of secondary-structure elements (SSE) of proteins and compared the assigned letters to each SSE using the [Formula: see text]-gram method. Although the results were comparable to those obtained via geometrical methods, the SSE length and accuracy of adjacency between SSEs were not considered in the comparison process. Therefore, to obtain further information on accuracy of adjacency between SSE vectors, the new approach of assigning text to vectors was adopted according to the spherical coordinate system in the present study. Moreover, dynamic programming was applied in order to account for the length of SSE vectors. Five common datasets were selected for method evaluation. The first three datasets were small, but difficult to align, and the remaining two datasets were used to compare the capability of the proposed method with that of other methods on a large protein dataset. The results showed that the proposed method, as a text-based alignment approach, obtained results comparable to both 1D and 3D methods. It outperformed 1D methods in terms of accuracy and 3D methods in terms of runtime.
Metabolic engineering is a research field that focuses on the design of models for metabolism, and uses computational procedures to suggest genetic manipulation. It aims to improve the yield of particular chemical or biochemical products. Several traditional metabolic engineering methods are commonly used to increase the production of a desired target, but the products are always far below their theoretical maximums. Using numeral optimisation algorithms to identify gene knockouts may stall at a local minimum in a multivariable function. This paper proposes a hybrid of the artificial bee colony (ABC) algorithm and the minimisation of metabolic adjustment (MOMA) to predict an optimal set of solutions in order to optimise the production rate of succinate and lactate. The dataset used in this work was from the iJO1366 Escherichia coli metabolic network. The experimental results include the production rate, growth rate and a list of knockout genes. From the comparative analysis, ABCMOMA produced better results compared to previous works, showing potential for solving genetic engineering problems.
Gene expression data are expected to be of significant help in the development of efficient cancer diagnoses and classification platforms. In order to select a small subset of informative genes from the data for cancer classification, recently, many researchers are analyzing gene expression data using various computational intelligence methods. However, due to the small number of samples compared to the huge number of genes (high dimension), irrelevant genes, and noisy genes, many of the computational methods face difficulties to select the small subset. Thus, we propose an improved (modified) binary particle swarm optimization to select the small subset of informative genes that is relevant for the cancer classification. In this proposed method, we introduce particles' speed for giving the rate at which a particle changes its position, and we propose a rule for updating particle's positions. By performing experiments on ten different gene expression datasets, we have found that the performance of the proposed method is superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also produces lower running times compared to BPSO.
The development of accurate computational models of biological processes is fundamental to computational systems biology. These models are usually represented by mathematical expressions that rely heavily on the system parameters. The measurement of these parameters is often difficult. Therefore, they are commonly estimated by fitting the predicted model to the experimental data using optimization methods. The complexity and nonlinearity of the biological processes pose a significant challenge, however, to the development of accurate and fast optimization methods. We introduce a new hybrid optimization method incorporating the Firefly Algorithm and the evolutionary operation of the Differential Evolution method. The proposed method improves solutions by neighbourhood search using evolutionary procedures. Testing our method on models for the arginine catabolism and the negative feedback loop of the p53 signalling pathway, we found that it estimated the parameters with high accuracy and within a reasonable computation time compared to well-known approaches, including Particle Swarm Optimization, Nelder-Mead, and Firefly Algorithm. We have also verified the reliability of the parameters estimated by the method using an a posteriori practical identifiability test.
This paper presents an in silico optimization method of metabolic pathway production. The metabolic pathway can be represented by a mathematical model known as the generalized mass action model, which leads to a complex nonlinear equations system. The optimization process becomes difficult when steady state and the constraints of the components in the metabolic pathway are involved. To deal with this situation, this paper presents an in silico optimization method, namely the Newton Cooperative Genetic Algorithm (NCGA). The NCGA used Newton method in dealing with the metabolic pathway, and then integrated genetic algorithm and cooperative co-evolutionary algorithm. The proposed method was experimentally applied on the benchmark metabolic pathways, and the results showed that the NCGA achieved better results compared to the existing methods.
One of the key aspects of computational systems biology is the investigation on the dynamic biological processes within cells. Computational models are often required to elucidate the mechanisms and principles driving the processes because of the nonlinearity and complexity. The models usually incorporate a set of parameters that signify the physical properties of the actual biological systems. In most cases, these parameters are estimated by fitting the model outputs with the corresponding experimental data. However, this is a challenging task because the available experimental data are frequently noisy and incomplete. In this paper, a new hybrid optimization method is proposed to estimate these parameters from the noisy and incomplete experimental data. The proposed method, called Swarm-based Chemical Reaction Optimization, integrates the evolutionary searching strategy employed by the Chemical Reaction Optimization, into the neighbouring searching strategy of the Firefly Algorithm method. The effectiveness of the method was evaluated using a simulated nonlinear model and two biological models: synthetic transcriptional oscillators, and extracellular protease production models. The results showed that the accuracy and computational speed of the proposed method were better than the existing Differential Evolution, Firefly Algorithm and Chemical Reaction Optimization methods. The reliability of the estimated parameters was statistically validated, which suggests that the model outputs produced by these parameters were valid even when noisy and incomplete experimental data were used. Additionally, Akaike Information Criterion was employed to evaluate the model selection, which highlighted the capability of the proposed method in choosing a plausible model based on the experimental data. In conclusion, this paper presents the effectiveness of the proposed method for parameter estimation and model selection problems using noisy and incomplete experimental data. This study is hoped to provide a new insight in developing more accurate and reliable biological models based on limited and low quality experimental data.
Reconstructions of genome-scale metabolic networks from different organisms have become popular in recent years. Metabolic engineering can simulate the reconstruction process to obtain desirable phenotypes. In previous studies, optimization algorithms have been implemented to identify the near-optimal sets of knockout genes for improving metabolite production. However, previous works contained premature convergence and the stop criteria were not clear for each case. Therefore, this study proposes an algorithm that is a hybrid of the ant colony optimization algorithm and flux balance analysis (ACOFBA) to predict near optimal sets of gene knockouts in an effort to maximize growth rates and the production of certain metabolites. Here, we present a case study that uses Baker's yeast, also known as Saccharomyces cerevisiae, as the model organism and target the rate of vanillin production for optimization. The results of this study are the growth rate of the model organism after gene deletion and a list of knockout genes. The ACOFBA algorithm was found to improve the yield of vanillin in terms of growth rate and production compared with the previous algorithms.
When gene expression data are too large to be processed, they are transformed into a reduced representation set of genes. Transforming large-scale gene expression data into a set of genes is called feature extraction. If the genes extracted are carefully chosen, this gene set can extract the relevant information from the large-scale gene expression data, allowing further analysis by using this reduced representation instead of the full size data. In this paper, we review numerous software applications that can be used for feature extraction. The software reviewed is mainly for Principal Component Analysis (PCA), Independent Component Analysis (ICA), Partial Least Squares (PLS), and Local Linear Embedding (LLE). A summary and sources of the software are provided in the last section for each feature extraction method.
This paper presents a study on gene knockout strategies to identify candidate genes to be knocked out for improving the production of succinic acid in Escherichia coli. Succinic acid is widely used as a precursor for many chemicals, for example production of antibiotics, therapeutic proteins and food. However, the chemical syntheses of succinic acid using the traditional methods usually result in the production that is far below their theoretical maximums. In silico gene knockout strategies are commonly implemented to delete the gene in E. coli to overcome this problem. In this paper, a hybrid of Ant Colony Optimization (ACO) and Minimization of Metabolic Adjustment (MoMA) is proposed to identify gene knockout strategies to improve the production of succinic acid in E. coli. As a result, the hybrid algorithm generated a list of knockout genes, succinic acid production rate and growth rate for E. coli after gene knockout. The results of the hybrid algorithm were compared with the previous methods, OptKnock and MOMAKnock. It was found that the hybrid algorithm performed better than OptKnock and MOMAKnock in terms of the production rate. The information from the results produced from the hybrid algorithm can be used in wet laboratory experiments to increase the production of succinic acid in E. coli.
Many biological research areas such as drug design require gene regulatory networks to provide clear insight and understanding of the cellular process in living cells. This is because interactions among the genes and their products play an important role in many molecular processes. A gene regulatory network can act as a blueprint for the researchers to observe the relationships among genes. Due to its importance, several computational approaches have been proposed to infer gene regulatory networks from gene expression data. In this review, six inference approaches are discussed: Boolean network, probabilistic Boolean network, ordinary differential equation, neural network, Bayesian network, and dynamic Bayesian network. These approaches are discussed in terms of introduction, methodology and recent applications of these approaches in gene regulatory network construction. These approaches are also compared in the discussion section. Furthermore, the strengths and weaknesses of these computational approaches are described.
Microbial strain optimization focuses on improving technological properties of the strain of microorganisms. However, the complexities of the metabolic networks, which lead to data ambiguity, often cause genetic modification on the desirable phenotypes difficult to predict. Furthermore, vast number of reactions in cellular metabolism lead to the combinatorial problem in obtaining optimal gene deletion strategy. Consequently, the computation time increases exponentially with the increase in the size of the problem. Hence, we propose an extension of a hybrid of Bees Algorithm and Flux Balance Analysis (BAFBA) by integrating OptKnock into BAFBA to validate the result. This paper presents a number of computational experiments to test on the performance and capability of BAFBA. Escherichia coli, Bacillus subtilis and Clostridium thermocellum are the model organisms in this paper. Also included is the identification of potential reactions to improve the production of succinic acid, lactic acid and ethanol, plus the discussion on the changes in the flux distribution of the predicted mutants. BAFBA shows potential in suggesting the non-intuitive gene knockout strategies and a low variability among the several runs. The results show that BAFBA is suitable, reliable and applicable in predicting optimal gene knockout strategy.
Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes.
Microbial strain optimisation for the overproduction of a desired phenotype has been a popular topic in recent years. Gene knockout is a genetic engineering technique that can modify the metabolism of microbial cells to obtain desirable phenotypes. Optimisation algorithms have been developed to identify the effects of gene knockout. However, the complexities of metabolic networks have made the process of identifying the effects of genetic modification on desirable phenotypes challenging. Furthermore, a vast number of reactions in cellular metabolism often lead to a combinatorial problem in obtaining optimal gene knockout. The computational time increases exponentially as the size of the problem increases. This work reports an extension of Bees Hill Flux Balance Analysis (BHFBA) to identify optimal gene knockouts to maximise the production yield of desired phenotypes while sustaining the growth rate. This proposed method functions by integrating OptKnock into BHFBA for validating the results automatically. The results show that the extension of BHFBA is suitable, reliable, and applicable in predicting gene knockout. Through several experiments conducted on Escherichia coli, Bacillus subtilis, and Clostridium thermocellum as model organisms, extension of BHFBA has shown better performance in terms of computational time, stability, growth rate, and production yield of desired phenotypes.
Incorporation of pathway knowledge into microarray analysis has brought better biological interpretation of the analysis outcome. However, most pathway data are manually curated without specific biological context. Non-informative genes could be included when the pathway data is used for analysis of context specific data like cancer microarray data. Therefore, efficient identification of informative genes is inevitable. Embedded methods like penalized classifiers have been used for microarray analysis due to their embedded gene selection. This paper proposes an improved penalized support vector machine with absolute t-test weighting scheme to identify informative genes and pathways. Experiments are done on four microarray data sets. The results are compared with previous methods using 10-fold cross validation in terms of accuracy, sensitivity, specificity and F-score. Our method shows consistent improvement over the previous methods and biological validation has been done to elucidate the relation of the selected genes and pathway with the phenotype under study.
In gene expression studies, missing values are a common problem with important consequences for the interpretation of the final data (Satija et al., Nat Biotechnol 33(5):495, 2015). Numerous bioinformatics examination tools are used for cancer prediction, including the data set matrix (Bailey et al., Cell 173(2):371-385, 2018); thus, it is necessary to resolve the problem of missing-values imputation. This chapter presents a review of the research on missing-values imputation approaches for gene expression data. By using local and global correlation of the data, we were able to focus mostly on the differences between the algorithms. We classified the algorithms as global, hybrid, local, or knowledge-based techniques. Additionally, this chapter presents suitable assessments of the different approaches. The purpose of this review is to focus on developments in the current techniques for scientists rather than applying different or newly developed algorithms with identical functional goals. The aim was to adapt the algorithms to the characteristics of the data.