بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Adaptive Wavelet Clustering for Highly Noisy Data

217 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zengjian Chen

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zengjian Chen - Jiayi Liu - Yihe Deng

قواعد البيانات استرجاع المعلومات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper we make progress on the unsupervised task of mining arbitrarily shaped clusters in highly noisy datasets, which is a task present in many real-world applications. Based on the fundamental work that first applies a wavelet transform to data clustering, we propose an adaptive clustering algorithm, denoted as AdaWave, which exhibits favorable characteristics for clustering. By a self-adaptive thresholding technique, AdaWave is parameter free and can handle data in various situations. It is deterministic, fast in linear time, order-insensitive, shape-insensitive, robust to highly noisy data, and requires no pre-knowledge on data models. Moreover, AdaWave inherits the ability from the wavelet transform to cluster data in different resolutions. We adopt the grid labeling data structure to drastically reduce the memory consumption of the wavelet transform so that AdaWave can be used for relatively high dimensional data. Experiments on synthetic as well as natural datasets demonstrate the effectiveness and efficiency of our proposed method.

قيم البحث

116 - Yves van Gennip , Blake Hunter , Anna Ma 2017

We consider the problem of duplicate detection in noisy and incomplete data: given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into unique entities, and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.

قواعد البيانات

Wavelet Adaptive Proper Orthogonal Decomposition for Large Scale Flow Data

187 - Philipp Krah , Thomas Engels , Kai Schneider 2020

The proper orthogonal decomposition (POD) is a powerful classical tool in fluid mechanics used, for instance, for model reduction and extraction of coherent flow features. However, its applicability to high-resolution data, as produced by three-dimen sional direct numerical simulations, is limited owing to its computational complexity. Here, we propose a wavelet-based adaptive version of the POD (the wPOD), in order to overcome this limitation. The amount of data to be analyzed is reduced by compressing them using biorthogonal wavelets, yielding a sparse representation while conveniently providing control of the compression error. Numerical analysis shows how the distinct error contributions of wavelet compression and POD truncation can be balanced under certain assumptions, allowing us to efficiently process high-resolution data from three-dimensional simulations of flow problems. Using a synthetic academic test case, we compare our algorithm with the randomized singular value decomposition. Furthermore, we demonstrate the ability of our method analyzing data of a 2D wake flow and a 3D flow generated by a flapping insect computed with direct numerical simulation.

ديناميات السوائل الهندسة الحاسوبية، المالية،العلوم التحليل العددي

Materialized View Selection by Query Clustering in XML Data Warehouses

488 - Hadj Mahboubi 2008

XML data warehouses form an interesting basis for decision-support applications that exploit complex data. However, native XML database management systems currently bear limited performances and it is necessary to design strategies to optimize them. In this paper, we propose an automatic strategy for the selection of XML materialized views that exploits a data mining technique, more precisely the clustering of the query workload. To validate our strategy, we implemented an XML warehouse modeled along the XCube specifications. We executed a workload of XQuery decision-support queries on this warehouse, with and without using our strategy. Our experimental results demonstrate its efficiency, even when queries are complex.

قواعد البيانات

Wavelet differentiation of a noisy signal

378 - I. Patrickeyev , R. Stepanov , P. Frick (Institute of Continuumn Mechanics 2004

Several differentiating algorithms of the noisy signals are considered. The proposed wavelet based technique is compared with others based on the Fourier transform and the finite differences. The accuracy of the calculations for different algorithms is estimated for two model examples.

الفيزياء الرياضية الفيزياء الرياضية

Categorical anomaly detection in heterogeneous data using minimum description length clustering

227 - James Cheney , Xavier Gombau , Ghita Berrada 2020

Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets represen ting a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.

قواعد البيانات الذكاء الاصطناعي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد الوطني للإدارة العامة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Adaptive Wavelet Clustering for Highly Noisy Data

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً