Improved Algorithms for Time Decay Streams


Abstract in English

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a emph{coreset}, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for $k$-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores $mathcal{O}(klog(hDelta)+h)$ points where $h$ is the half-life of the decay function and $Delta$ is the aspect ratio of the dataset. Our techniques extend to $k$-means clustering and $M$-estimators as well.

Download