Displaying publications 1 - 20 of 64 in total

Abstract:
Sort:
  1. Loh SY, Jahans-Price T, Greenwood MP, Greenwood M, Hoe SZ, Konopacka A, et al.
    eNeuro, 2017 12 21;4(6).
    PMID: 29279858 DOI: 10.1523/ENEURO.0243-17.2017
    The supraoptic nucleus (SON) is a group of neurons in the hypothalamus responsible for the synthesis and secretion of the peptide hormones vasopressin and oxytocin. Following physiological cues, such as dehydration, salt-loading and lactation, the SON undergoes a function related plasticity that we have previously described in the rat at the transcriptome level. Using the unsupervised graphical lasso (Glasso) algorithm, we reconstructed a putative network from 500 plastic SON genes in which genes are the nodes and the edges are the inferred interactions. The most active nodal gene identified within the network was Caprin2. Caprin2 encodes an RNA-binding protein that we have previously shown to be vital for the functioning of osmoregulatory neuroendocrine neurons in the SON of the rat hypothalamus. To test the validity of the Glasso network, we either overexpressed or knocked down Caprin2 transcripts in differentiated rat pheochromocytoma PC12 cells and showed that these manipulations had significant opposite effects on the levels of putative target mRNAs. These studies suggest that the predicative power of the Glasso algorithm within an in vivo system is accurate, and identifies biological targets that may be important to the functional plasticity of the SON.
    Matched MeSH terms: Data Mining
  2. Al-Dabbagh MM, Salim N, Rehman A, Alkawaz MH, Saba T, Al-Rodhaan M, et al.
    ScientificWorldJournal, 2014;2014:612787.
    PMID: 25309952 DOI: 10.1155/2014/612787
    This paper presents a novel features mining approach from documents that could not be mined via optical character recognition (OCR). By identifying the intimate relationship between the text and graphical components, the proposed technique pulls out the Start, End, and Exact values for each bar. Furthermore, the word 2-gram and Euclidean distance methods are used to accurately detect and determine plagiarism in bar charts.
    Matched MeSH terms: Data Mining/methods*
  3. Zolhavarieh S, Aghabozorgi S, Teh YW
    ScientificWorldJournal, 2014;2014:312521.
    PMID: 25140332 DOI: 10.1155/2014/312521
    Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.
    Matched MeSH terms: Data Mining*
  4. Khairudin NM, Mustapha A, Ahmad MH
    ScientificWorldJournal, 2014;2014:813983.
    PMID: 24587757 DOI: 10.1155/2014/813983
    The advent of web-based applications and services has created such diverse and voluminous web log data stored in web servers, proxy servers, client machines, or organizational databases. This paper attempts to investigate the effect of temporal attribute in relational rule mining for web log data. We incorporated the characteristics of time in the rule mining process and analysed the effect of various temporal parameters. The rules generated from temporal relational rule mining are then compared against the rules generated from the classical rule mining approach such as the Apriori and FP-Growth algorithms. The results showed that by incorporating the temporal attribute via time, the number of rules generated is subsequently smaller but is comparable in terms of quality.
    Matched MeSH terms: Data Mining/methods*
  5. Azadnia AH, Taheri S, Ghadimi P, Saman MZ, Wong KY
    ScientificWorldJournal, 2013;2013:246578.
    PMID: 23864823 DOI: 10.1155/2013/246578
    One of the cost-intensive issues in managing warehouses is the order picking problem which deals with the retrieval of items from their storage locations in order to meet customer requests. Many solution approaches have been proposed in order to minimize traveling distance in the process of order picking. However, in practice, customer orders have to be completed by certain due dates in order to avoid tardiness which is neglected in most of the related scientific papers. Consequently, we proposed a novel solution approach in order to minimize tardiness which consists of four phases. First of all, weighted association rule mining has been used to calculate associations between orders with respect to their due date. Next, a batching model based on binary integer programming has been formulated to maximize the associations between orders within each batch. Subsequently, the order picking phase will come up which used a Genetic Algorithm integrated with the Traveling Salesman Problem in order to identify the most suitable travel path. Finally, the Genetic Algorithm has been applied for sequencing the constructed batches in order to minimize tardiness. Illustrative examples and comparisons are presented to demonstrate the proficiency and solution quality of the proposed approach.
    Matched MeSH terms: Data Mining/methods*
  6. Azareh A, Rahmati O, Rafiei-Sardooi E, Sankey JB, Lee S, Shahabi H, et al.
    Sci Total Environ, 2019 Mar 10;655:684-696.
    PMID: 30476849 DOI: 10.1016/j.scitotenv.2018.11.235
    Gully erosion susceptibility mapping is a fundamental tool for land-use planning aimed at mitigating land degradation. However, the capabilities of some state-of-the-art data-mining models for developing accurate maps of gully erosion susceptibility have not yet been fully investigated. This study assessed and compared the performance of two different types of data-mining models for accurately mapping gully erosion susceptibility at a regional scale in Chavar, Ilam, Iran. The two methods evaluated were: Certainty Factor (CF), a bivariate statistical model; and Maximum Entropy (ME), an advanced machine learning model. Several geographic and environmental factors that can contribute to gully erosion were considered as predictor variables of gully erosion susceptibility. Based on an existing differential GPS survey inventory of gully erosion, a total of 63 eroded gullies were spatially randomly split in a 70:30 ratio for use in model calibration and validation, respectively. Accuracy assessments completed with the receiver operating characteristic curve method showed that the ME-based regional gully susceptibility map has an area under the curve (AUC) value of 88.6% whereas the CF-based map has an AUC of 81.8%. According to jackknife tests that were used to investigate the relative importance of predictor variables, aspect, distance to river, lithology and land use are the most influential factors for the spatial distribution of gully erosion susceptibility in this region of Iran. The gully erosion susceptibility maps produced in this study could be useful tools for land managers and engineers tasked with road development, urbanization and other future development.
    Matched MeSH terms: Data Mining
  7. Bui DT, Khosravi K, Karimi M, Busico G, Khozani ZS, Nguyen H, et al.
    Sci Total Environ, 2020 May 01;715:136836.
    PMID: 32007881 DOI: 10.1016/j.scitotenv.2020.136836
    Groundwater resources constitute the main source of clean fresh water for domestic use and it is essential for food production in the agricultural sector. Groundwater has a vital role for water supply in the Campanian Plain in Italy and hence a future sustainability of the resource is essential for the region. In the current paper novel data mining algorithms including Gaussian Process (GP) were used in a large groundwater quality database to predict nitrate (contaminant) and strontium (potential future increasing) concentrations in groundwater. The results were compared with M5P, random forest (RF) and random tree (RT) algorithms as a benchmark to test the robustness of the modeling process. The dataset includes 246 groundwater quality samples originating from different wells, municipals and agricultural. It was divided for the modeling process into two subgroups by using the 10-fold cross validation technique including 173 samples for model building (training dataset) and 73 samples for model validation (testing dataset). Different water quality variables including T, pH, EC, HCO3-, F-, Cl-, SO42-, Na+, K+, Mg2+, and Ca2+ have been used as an input to the models. At first stage, different input combinations have been constructed based on correlation coefficient and thus the optimal combination was chosen for the modeling phase. Different quantitative criteria alongside with visual comparison approach have been used for evaluating the modeling capability. Results revealed that to obtain reliable results also variables with low correlation should be considered as an input to the models together with those variables showing high correlation coefficients. According to the model evaluation criteria, GP algorithm outperforms all the other models in predicting both nitrate and strontium concentrations followed by RF, M5P and RT, respectively. Result also revealed that model's structure together with the accuracy and structure of the data can have a relevant impact on the model's results.
    Matched MeSH terms: Data Mining
  8. Ghaibeh AA, Kasem A, Ng XJ, Nair HLK, Hirose J, Thiruchelvam V
    Stud Health Technol Inform, 2018;247:386-390.
    PMID: 29677988
    The analysis of Electronic Health Records (EHRs) is attracting a lot of research attention in the medical informatics domain. Hospitals and medical institutes started to use data mining techniques to gain new insights from the massive amounts of data that can be made available through EHRs. Researchers in the medical field have often used descriptive statistics and classical statistical methods to prove assumed medical hypotheses. However, discovering new insights from large amounts of data solely based on experts' observations is difficult. Using data mining techniques and visualizations, practitioners can find hidden knowledge, identify interesting patterns, or formulate new hypotheses to be further investigated. This paper describes a work in progress on using data mining methods to analyze clinical data of Nasopharyngeal Carcinoma (NPC) cancer patients. NPC is the fifth most common cancer among Malaysians, and the data analyzed in this study was collected from three states in Malaysia (Kuala Lumpur, Sabah and Sarawak), and is considered to be the largest up-to-date dataset of its kind. This research is addressing the issue of cancer recurrence after the completion of radiotherapy and chemotherapy treatment. We describe the procedure, problems, and insights gained during the process.
    Matched MeSH terms: Data Mining*
  9. Habib ur Rehman M, Liew CS, Wah TY, Shuja J, Daghighi B
    Sensors (Basel), 2015 Feb 13;15(2):4430-69.
    PMID: 25688592 DOI: 10.3390/s150204430
    The staggering growth in smartphone and wearable device use has led to a massive scale generation of personal (user-specific) data. To explore, analyze, and extract useful information and knowledge from the deluge of personal data, one has to leverage these devices as the data-mining platforms in ubiquitous, pervasive, and big data environments. This study presents the personal ecosystem where all computational resources, communication facilities, storage and knowledge management systems are available in user proximity. An extensive review on recent literature has been conducted and a detailed taxonomy is presented. The performance evaluation metrics and their empirical evidences are sorted out in this paper. Finally, we have highlighted some future research directions and potentially emerging application areas for personal data mining using smartphones and wearable devices.
    Matched MeSH terms: Data Mining
  10. Lay US, Pradhan B, Yusoff ZBM, Abdallah AFB, Aryal J, Park HJ
    Sensors (Basel), 2019 Aug 07;19(16).
    PMID: 31394777 DOI: 10.3390/s19163451
    Cameron Highland is a popular tourist hub in the mountainous area of Peninsular Malaysia. Most communities in this area suffer frequent incidence of debris flow, especially during monsoon seasons. Despite the loss of lives and properties recorded annually from debris flow, most studies in the region concentrate on landslides and flood susceptibilities. In this study, debris-flow susceptibility prediction was carried out using two data mining techniques; Multivariate Adaptive Regression Splines (MARS) and Support Vector Regression (SVR) models. The existing inventory of debris-flow events (640 points) were selected for training 70% (448) and validation 30% (192). Twelve conditioning factors namely; elevation, plan-curvature, slope angle, total curvature, slope aspect, Stream Transport Index (STI), profile curvature, roughness index, Stream Catchment Area (SCA), Stream Power Index (SPI), Topographic Wetness Index (TWI) and Topographic Position Index (TPI) were selected from Light Detection and Ranging (LiDAR)-derived Digital Elevation Model (DEM) data. Multi-collinearity was checked using Information Factor, Cramer's V, and Gini Index to identify the relative importance of conditioning factors. The susceptibility models were produced and categorized into five classes; not-susceptible, low, moderate, high and very-high classes. Models performances were evaluated using success and prediction rates where the area under the curve (AUC) showed a higher performance of MARS (93% and 83%) over SVR (76% and 72%). The result of this study will be important in contingency hazards and risks management plans to reduce the loss of lives and properties in the area.
    Matched MeSH terms: Data Mining
  11. Liu J, Yinchai W, Siong TC, Li X, Zhao L, Wei F
    Sci Rep, 2022 Dec 01;12(1):20770.
    PMID: 36456582 DOI: 10.1038/s41598-022-23765-x
    For generating an interpretable deep architecture for identifying deep intrusion patterns, this study proposes an approach that combines ANFIS (Adaptive Network-based Fuzzy Inference System) and DT (Decision Tree) for interpreting the deep pattern of intrusion detection. Meanwhile, for improving the efficiency of training and predicting, Pearson Correlation analysis, standard deviation, and a new adaptive K-means are used to select attributes and make fuzzy interval decisions. The proposed algorithm was trained, validated, and tested on the NSL-KDD (National security lab-knowledge discovery and data mining) dataset. Using 22 attributes that highly related to the target, the performance of the proposed method achieves a 99.86% detection rate and 0.14% false alarm rate on the KDDTrain+ dataset, a 77.46% detection rate on the KDDTest+ dataset, which is better than many classifiers. Besides, the interpretable model can help us demonstrate the complex and overlapped pattern of intrusions and analyze the pattern of various intrusions.
    Matched MeSH terms: Data Mining
  12. Sarahani Harun, Nurulisa Zulkifle
    Sains Malaysiana, 2018;47:2933-2940.
    Laryngeal cancer is the most common head and neck cancer in the world and its incidence is on the rise. However, the
    molecular mechanism underlying laryngeal cancer pathogenesis is poorly understood. The goal of this study was to
    develop a protein-protein interaction (PPI) network for laryngeal cancer to predict the biological pathways that underlie
    the molecular complexes in the network. Genes involved in laryngeal cancer were extracted from the OMIM database
    and their interaction partners were identified via text and data mining using Agilent Literature Search, STRING and
    GeneMANIA. PPI network was then integrated and visualised using Cytoscape ver3.6.0. Molecular complexes in the
    network were predicted by MCODE plugin and functional enrichment analyses of the molecular complexes were performed
    using BiNGO. 28 laryngeal cancer-related genes were present in the OMIM database. The PPI network associated with
    laryngeal cancer contained 161 nodes, 661 edges and five molecular complexes. Some of the complexes were related to
    the biological behaviour of cancer, providing the foundation for further understanding of the mechanism of laryngeal
    cancer development and progression.
    Matched MeSH terms: Data Mining
  13. Husam IS, Abuhamad, Azuraliza Abu Bakar, Suhaila Zainudin, Mazrura Sahani, Zainudin Mohd Ali
    Sains Malaysiana, 2017;46:255-265.
    Dengue fever is considered as one of the most common mosquito borne diseases worldwide. Dengue outbreak detection can be very useful in terms of practical efforts to overcome the rapid spread of the disease by providing the knowledge to predict the next outbreak occurrence. Many studies have been conducted to model and predict dengue outbreak using different data mining techniques. This research aimed to identify the best features that lead to better predictive accuracy of dengue outbreaks using three different feature selection algorithms; particle swarm optimization (PSO), genetic algorithm (GA) and rank search (RS). Based on the selected features, three predictive modeling techniques (J48, DTNB and Naive Bayes) were applied for dengue outbreak detection. The dataset used in this research was obtained from the Public Health Department, Seremban, Negeri Sembilan, Malaysia. The experimental results showed that the predictive accuracy was improved by applying feature selection process before the predictive modeling process. The study also showed the set of features to represent dengue outbreak detection for Malaysian health agencies.
    Matched MeSH terms: Data Mining
  14. Shirkhorshidi AS, Aghabozorgi S, Wah TY
    PLoS One, 2015;10(12):e0144059.
    PMID: 26658987 DOI: 10.1371/journal.pone.0144059
    Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.
    Matched MeSH terms: Data Mining/statistics & numerical data*
  15. Golestan Hashemi FS, Rafii MY, Ismail MR, Mohamed MT, Rahim HA, Latif MA, et al.
    PLoS One, 2015;10(6):e0129069.
    PMID: 26061689 DOI: 10.1371/journal.pone.0129069
    When a phenotype of interest is associated with an external/internal covariate, covariate inclusion in quantitative trait loci (QTL) analyses can diminish residual variation and subsequently enhance the ability of QTL detection. In the in vitro synthesis of 2-acetyl-1-pyrroline (2AP), the main fragrance compound in rice, the thermal processing during the Maillard-type reaction between proline and carbohydrate reduction produces a roasted, popcorn-like aroma. Hence, for the first time, we included the proline amino acid, an important precursor of 2AP, as a covariate in our QTL mapping analyses to precisely explore the genetic factors affecting natural variation for rice scent. Consequently, two QTLs were traced on chromosomes 4 and 8. They explained from 20% to 49% of the total aroma phenotypic variance. Additionally, by saturating the interval harboring the major QTL using gene-based primers, a putative allele of fgr (major genetic determinant of fragrance) was mapped in the QTL on the 8th chromosome in the interval RM223-SCU015RM (1.63 cM). These loci supported previous studies of different accessions. Such QTLs can be widely used by breeders in crop improvement programs and for further fine mapping. Moreover, no previous studies and findings were found on simultaneous assessment of the relationship among 2AP, proline and fragrance QTLs. Therefore, our findings can help further our understanding of the metabolomic and genetic basis of 2AP biosynthesis in aromatic rice.
    Matched MeSH terms: Data Mining
  16. Uddin J, Ghazali R, Deris MM
    PLoS One, 2017;12(1):e0164803.
    PMID: 28068344 DOI: 10.1371/journal.pone.0164803
    Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Recently, many attentions have been put on categorical data clustering, where data objects are made up of non-numerical attributes. For categorical data clustering the rough set based approaches such as Maximum Dependency Attribute (MDA) and Maximum Significance Attribute (MSA) has outperformed their predecessor approaches like Bi-Clustering (BC), Total Roughness (TR) and Min-Min Roughness(MMR). This paper presents the limitations and issues of MDA and MSA techniques on special type of data sets where both techniques fails to select or faces difficulty in selecting their best clustering attribute. Therefore, this analysis motivates the need to come up with better and more generalize rough set theory approach that can cope the issues with MDA and MSA. Hence, an alternative technique named Maximum Indiscernible Attribute (MIA) for clustering categorical data using rough set indiscernible relations is proposed. The novelty of the proposed approach is that, unlike other rough set theory techniques, it uses the domain knowledge of the data set. It is based on the concept of indiscernibility relation combined with a number of clusters. To show the significance of proposed approach, the effect of number of clusters on rough accuracy, purity and entropy are described in the form of propositions. Moreover, ten different data sets from previously utilized research cases and UCI repository are used for experiments. The results produced in tabular and graphical forms shows that the proposed MIA technique provides better performance in selecting the clustering attribute in terms of purity, entropy, iterations, time, accuracy and rough accuracy.
    Matched MeSH terms: Data Mining
  17. Hakak S, Kamsin A, Shivakumara P, Idna Idris MY, Gilkar GA
    PLoS One, 2018;13(7):e0200912.
    PMID: 30048486 DOI: 10.1371/journal.pone.0200912
    Exact pattern matching algorithms are popular and used widely in several applications, such as molecular biology, text processing, image processing, web search engines, network intrusion detection systems and operating systems. The focus of these algorithms is to achieve time efficiency according to applications but not memory consumption. In this work, we propose a novel idea to achieve both time efficiency and memory consumption by splitting query string for searching in Corpus. For a given text, the proposed algorithm split the query pattern into two equal halves and considers the second (right) half as a query string for searching in Corpus. Once the match is found with second halves, the proposed algorithm applies brute force procedure to find remaining match by referring the location of right half. Experimental results on different S1 Dataset, namely Arabic, English, Chinese, Italian and French text databases show that the proposed algorithm outperforms the existing S1 Algorithm in terms of time efficiency and memory consumption as the length of the query pattern increases.
    Matched MeSH terms: Data Mining/methods*
  18. Salih SQ, Alsewari AA, Wahab HA, Mohammed MKA, Rashid TA, Das D, et al.
    PLoS One, 2023;18(7):e0288044.
    PMID: 37406006 DOI: 10.1371/journal.pone.0288044
    The retrieval of important information from a dataset requires applying a special data mining technique known as data clustering (DC). DC classifies similar objects into a groups of similar characteristics. Clustering involves grouping the data around k-cluster centres that typically are selected randomly. Recently, the issues behind DC have called for a search for an alternative solution. Recently, a nature-based optimization algorithm named Black Hole Algorithm (BHA) was developed to address the several well-known optimization problems. The BHA is a metaheuristic (population-based) that mimics the event around the natural phenomena of black holes, whereby an individual star represents the potential solutions revolving around the solution space. The original BHA algorithm showed better performance compared to other algorithms when applied to a benchmark dataset, despite its poor exploration capability. Hence, this paper presents a multi-population version of BHA as a generalization of the BHA called MBHA wherein the performance of the algorithm is not dependent on the best-found solution but a set of generated best solutions. The method formulated was subjected to testing using a set of nine widespread and popular benchmark test functions. The ensuing experimental outcomes indicated the highly precise results generated by the method compared to BHA and comparable algorithms in the study, as well as excellent robustness. Furthermore, the proposed MBHA achieved a high rate of convergence on six real datasets (collected from the UCL machine learning lab), making it suitable for DC problems. Lastly, the evaluations conclusively indicated the appropriateness of the proposed algorithm to resolve DC issues.
    Matched MeSH terms: Data Mining/methods
  19. Aqra I, Herawan T, Abdul Ghani N, Akhunzada A, Ali A, Bin Razali R, et al.
    PLoS One, 2018;13(1):e0179703.
    PMID: 29351287 DOI: 10.1371/journal.pone.0179703
    Designing an efficient association rule mining (ARM) algorithm for multilevel knowledge-based transactional databases that is appropriate for real-world deployments is of paramount concern. However, dynamic decision making that needs to modify the threshold either to minimize or maximize the output knowledge certainly necessitates the extant state-of-the-art algorithms to rescan the entire database. Subsequently, the process incurs heavy computation cost and is not feasible for real-time applications. The paper addresses efficiently the problem of threshold dynamic updation for a given purpose. The paper contributes by presenting a novel ARM approach that creates an intermediate itemset and applies a threshold to extract categorical frequent itemsets with diverse threshold values. Thus, improving the overall efficiency as we no longer needs to scan the whole database. After the entire itemset is built, we are able to obtain real support without the need of rebuilding the itemset (e.g. Itemset list is intersected to obtain the actual support). Moreover, the algorithm supports to extract many frequent itemsets according to a pre-determined minimum support with an independent purpose. Additionally, the experimental results of our proposed approach demonstrate the capability to be deployed in any mining system in a fully parallel mode; consequently, increasing the efficiency of the real-time association rules discovery process. The proposed approach outperforms the extant state-of-the-art and shows promising results that reduce computation cost, increase accuracy, and produce all possible itemsets.
    Matched MeSH terms: Data Mining/methods*
  20. Yuhanis Yusof, Mohammed Hayel Refai
    MyJurnal
    As the amount of document increases, automation of classification that aids the analysis and management of documents receive focal attention. Classification, based on association rules that are generated from a collection of documents, is a recent data mining approach that integrates association rule mining and classification. The existing approaches produces either high accuracy with large number of rules or a small number of association rules that generate low accuracy. This work presents an association rule mining that employs a new item production algorithm that generates a small number of rules and produces an acceptable accuracy rate. The proposed method is evaluated on UCI datasets and measured based on prediction accuracy and the number of generated association rules. Comparison is later made against an existing classifier, Multi-class Classification based on Association Rule (MCAR). From the undertaken experiments, it is learned that the proposed method produces similar accuracy rate as MCAR but yet uses lesser number of rules.
    Matched MeSH terms: Data Mining
Filters
Contact Us

Please provide feedback to Administrator (afdal@afpm.org.my)

External Links