Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.
Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.
Digital image forgery is becoming easier to perform because of the rapid development of various manipulation tools. Image splicing is one of the most prevalent techniques. Digital images had lost their trustability, and researches have exerted considerable effort to regain such trustability by focusing mostly on algorithms. However, most of the proposed algorithms are incapable of handling high dimensionality and redundancy in the extracted features. Moreover, existing algorithms are limited by high computational time. This study focuses on improving one of the image splicing detection algorithms, that is, the run length run number algorithm (RLRN), by applying two dimension reduction methods, namely, principal component analysis (PCA) and kernel PCA. Support vector machine is used to distinguish between authentic and spliced images. Results show that kernel PCA is a nonlinear dimension reduction method that has the best effect on R, G, B, and Y channels and gray-scale images.
Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using the k-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets.
Tuberculosis (TB) is a major global health problem, which has been ranked as the second leading cause of death from an infectious disease worldwide. Diagnosis based on cultured specimens is the reference standard, however results take weeks to process. Scientists are looking for early detection strategies, which remain the cornerstone of tuberculosis control. Consequently there is a need to develop an expert system that helps medical professionals to accurately and quickly diagnose the disease. Artificial Immune Recognition System (AIRS) has been used successfully for diagnosing various diseases. However, little effort has been undertaken to improve its classification accuracy.