ClaraStream: A Novel Algorithm for Real-Time Data Stream Clustering
Main Article Content
Abstract
One crucial step in extracting knowledge from datasets is to cluster or split the records of the data set into groups of related records. The detection of clusters in very large, multi-dimensional, static datasets has been the subject of extensive investigation. Unfortunately, this study has led to the development of classical clustering that is ineffective for clustering data streams. A data stream is a dynamic data set that is defined as an infinite sequence of data records that changes over time and arrives at very fast rates. There are many processes in the world today that produce rapidly changing data streams at high speeds. Credit card transactions, click streams, and sensor networks are a few examples. The rapid proliferation of data in various fields necessitates the development of algorithms capable of processing and analyzing data streams in real-time. ClaraStream is a novel clustering algorithm designed to efficiently handle the unique challenges posed by data streams, including their high volume, velocity, and potentially boundless nature. Unlike traditional clustering methods that are suitable for static datasets, ClaraStream offers a two-phase approach—online micro-clustering and offline macro-clustering—that enables real-time processing and trend analysis. This paper provides a comprehensive overview of the ClaraStream algorithm, its architecture, and its application to air quality data streams.
