Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Adaptive Wavelet Clustering for Highly Noisy Data

217 0 0.0 ( 0 )

Download Cite

Added by Zengjian Chen

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Zengjian Chen - Jiayi Liu - Yihe Deng

Databases Information Retrieval

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper we make progress on the unsupervised task of mining arbitrarily shaped clusters in highly noisy datasets, which is a task present in many real-world applications. Based on the fundamental work that first applies a wavelet transform to data clustering, we propose an adaptive clustering algorithm, denoted as AdaWave, which exhibits favorable characteristics for clustering. By a self-adaptive thresholding technique, AdaWave is parameter free and can handle data in various situations. It is deterministic, fast in linear time, order-insensitive, shape-insensitive, robust to highly noisy data, and requires no pre-knowledge on data models. Moreover, AdaWave inherits the ability from the wavelet transform to cluster data in different resolutions. We adopt the grid labeling data structure to drastically reduce the memory consumption of the wavelet transform so that AdaWave can be used for relatively high dimensional data. Experiments on synthetic as well as natural datasets demonstrate the effectiveness and efficiency of our proposed method.

rate research

Unsupervised record matching with noisy and incomplete data

116 - Yves van Gennip , Blake Hunter , Anna Ma 2017

We consider the problem of duplicate detection in noisy and incomplete data: given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into unique entities, and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.

Databases

Wavelet Adaptive Proper Orthogonal Decomposition for Large Scale Flow Data

187 - Philipp Krah , Thomas Engels , Kai Schneider 2020

The proper orthogonal decomposition (POD) is a powerful classical tool in fluid mechanics used, for instance, for model reduction and extraction of coherent flow features. However, its applicability to high-resolution data, as produced by three-dimensional direct numerical simulations, is limited owing to its computational complexity. Here, we propose a wavelet-based adaptive version of the POD (the wPOD), in order to overcome this limitation. The amount of data to be analyzed is reduced by compressing them using biorthogonal wavelets, yielding a sparse representation while conveniently providing control of the compression error. Numerical analysis shows how the distinct error contributions of wavelet compression and POD truncation can be balanced under certain assumptions, allowing us to efficiently process high-resolution data from three-dimensional simulations of flow problems. Using a synthetic academic test case, we compare our algorithm with the randomized singular value decomposition. Furthermore, we demonstrate the ability of our method analyzing data of a 2D wake flow and a 3D flow generated by a flapping insect computed with direct numerical simulation.

Fluid Dynamics Computational Engineering Numerical Analysis

Materialized View Selection by Query Clustering in XML Data Warehouses

455 - Hadj Mahboubi 2008

XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native XML database management systems currently bear limited performances and it is necessary to design strategies to optimize them. In this paper, we propose an automatic strategy for the selection of XML materialized views that exploits a data mining technique, more precisely the clustering of the query workload. To validate our strategy, we implemented an XML warehouse modeled along the XCube specifications. We executed a workload of XQuery decision-support queries on this warehouse, with and without using our strategy. Our experimental results demonstrate its efficiency, even when queries are complex.

Databases

Wavelet differentiation of a noisy signal

378 - I. Patrickeyev , R. Stepanov , P. Frick (Institute of Continuumn Mechanics 2004

Several differentiating algorithms of the noisy signals are considered. The proposed wavelet based technique is compared with others based on the Fourier transform and the finite differences. The accuracy of the calculations for different algorithms is estimated for two model examples.

Mathematical Physics Mathematical Physics

Categorical anomaly detection in heterogeneous data using minimum description length clustering

227 - James Cheney , Xavier Gombau , Ghita Berrada 2020

Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets representing a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.

Databases Artificial Intelligence

comments

Fetching comments

National Institute of Business Administration

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Adaptive Wavelet Clustering for Highly Noisy Data

Ask ChatGPT about the research

No Arabic abstract

Read More