A Unique Pipeline Model to Improve Anomaly Detection in High Dimensional Data
Main Article Content
Abstract
This paper presents a comprehensive method for dimension reduction and detecting anomalies in high-dimensional data (on healthcare datasets) using R. Realizing that traditional linear methods such as Principal Component Analysis (PCA) often ignore the complexity of the non-linear manifold of the data, our approach exploits iterative learning, on the belief that high-dimensional data is largely based on a low-dimensional manifold. The methodology starts by preparing the data using R libraries like Keras, dplyr, and ggplot2, addressing challenges like missing values ??and visualizing meaningful information. Using the Mahalanobis distance, the paper identifies and removes country-specific outliers. The pipelined model integrates Principal Component Analysis (PCA) for data transformation and combines an Autoencoder with t-SNE for dimensionality reduction. This refined dataset is then used to train a Multi-Layer Perceptron (MLP) artificial neural network, which facilitates anomaly detection based on reconstruction errors, illustrated by the point cloud. Additionally, the paper explores metric multidimensional scaling using artificial neural networks, tests large datasets such as healthcare and wine, and compares the results of the work using conventional techniques. This study highlights the effectiveness of integrating various pre-processing, visualization, and artificial neural network strategies through R for effective anomaly detection.