مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Characterizing Discriminative Patterns

47 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Gang Fang

تاريخ النشر 2011

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Gang Fang - Wen Wang - Benjamin Oatley

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Discriminative patterns are association patterns that occur with disproportionate frequency in some classes versus others, and have been studied under names such as emerging patterns and contrast sets. Such patterns have demonstrated considerable value for classification and subgroup discovery, but a detailed understanding of the types of interactions among items in a discriminative pattern is lacking. To address this issue, we propose to categorize discriminative patterns according to four types of item interaction: (i) driver-passenger, (ii) coherent, (iii) independent additive and (iv) synergistic beyond independent additive. Either of the last three is of practical importance, with the latter two representing a gain in the discriminative power of a pattern over its subsets. Synergistic patterns are most restrictive, but perhaps the most interesting since they capture a cooperative effect. For domains such as genetic research, differentiating among these types of patterns is critical since each yields very different biological interpretations. For general domains, the characterization provides a novel view of the nature of the discriminative patterns in a dataset, which yields insights beyond those provided by current approaches that focus mostly on pattern-based classification and subgroup discovery. This paper presents a comprehensive discussion that defines these four pattern types and investigates their properties and their relationship to one another. In addition, these ideas are explored for a variety of datasets (ten UCI datasets, one gene expression dataset and two genetic-variation datasets). The results demonstrate the existence, characteristics and statistical significance of the different types of patterns. They also illustrate how pattern characterization can provide novel insights into discriminative pattern mining and the discriminative structure of different datasets.

قيم البحث

84 - Tianyu Liu , Chi Wang 2020

We study the hardness of Approximate Query Processing (AQP) of various types of queries involving joins over multiple tables of possibly different sizes. In the case where the query result is a single value (e.g., COUNT, SUM, and COUNT(DISTINCT)), we prove worst-case information-theoretic lower bounds for AQP problems that are given parameters $epsilon$ and $delta$, and return estimated results within a factor of 1+$epsilon$ of the true results with error probability at most $delta$. In particular, the lower bounds for cardinality estimation over joins under various settings are contained in our results. Informally, our results show that for various database queries with joins, unless restricted to the set of queries whose results are always guaranteed to be above a very large threshold, the amount of information an AQP algorithm needs for returning an accurate approximation is at least linear in the number of rows in the largest table. Similar lower bounds even hold for some special cases where additional information such as top-K heavy hitters and all frequency vectors are available. In the case of GROUP-BY where the query result is not a single number, we study the lower bound for the amount of information used by any approximation algorithm that does not report any non-existing group and does not miss groups of large total size. Our work extends the work of Alon, Gibbons, Matias, and Szegedy [AGMS99].We compare our lower bounds with the amount of information required by Bernoulli sampling to give an accurate approximation. For COUNT queries with joins over multiple tables of the same size, the upper bound matches the lower bound, unless the problem setting is restricted to the set of queries whose results are always guaranteed to be above a very large threshold.

قواعد البيانات نظرية المعلومات نظرية المعلومات

Information geometric analysis of phase transitions in complex patterns: the case of the Gray-Scott reaction-diffusion model

339 - Omri Har Shemesh , Rick Quax , Alfons G. Hoekstra 2015

The Fisher-Rao metric from Information Geometry is related to phase transition phenomena in classical statistical mechanics. Several studies propose to extend the use of Information Geometry to study more general phase transitions in complex systems. However, it is unclear whether the Fisher-Rao metric does indeed detect these more general transitions, especially in the absence of a statistical model. In this paper we study the transitions between patterns in the Gray-Scott reaction-diffusion model using Fisher information. We describe the system by a probability density function that represents the size distribution of blobs in the patterns and compute its Fisher information with respect to changing the two rate parameters of the underlying model. We estimate the distribution non-parametrically so that we do not assume any statistical model. The resulting Fisher map can be interpreted as a phase-map of the different patterns. Lines with high Fisher information can be considered as boundaries between regions of parameter space where patterns with similar characteristics appear. These lines of high Fisher information can be interpreted as phase transitions between complex patterns.

الميكانيكا الإحصائية نظرية المعلومات نظرية المعلومات

TKUS: Mining Top-K High-Utility Sequential Patterns

132 - Chunkai Zhang , Zilin Du , Wensheng Gan 2020

High-utility sequential pattern mining (HUSPM) has recently emerged as a focus of intense research interest. The main task of HUSPM is to find all subsequences, within a quantitative sequential database, that have high utility with respect to a user- defined minimum utility threshold. However, it is difficult to specify the minimum utility threshold, especially when database features, which are invisible in most cases, are not understood. To handle this problem, top-k HUSPM was proposed. Up to now, only very preliminary work has been conducted to capture top-k HUSPs, and existing strategies require improvement in terms of running time, memory consumption, unpromising candidate filtering, and scalability. Moreover, no systematic problem statement has been defined. In this paper, we formulate the problem of top-k HUSPM and propose a novel algorithm called TKUS. To improve efficiency, TKUS adopts a projection and local search mechanism and employs several schemes, including the Sequence Utility Raising, Terminate Descendants Early, and Eliminate Unpromising Items strategies, which allow it to greatly reduce the search space. Finally, experimental results demonstrate that TKUS can achieve sufficiently good top-k HUSPM performance compared to state-of-the-art algorithm TKHUS-Span.

قواعد البيانات

Cache-Efficient Fork-Processing Patterns on Large Graphs

246 - Shengliang Lu , Shixuan Sun , Johns Paul 2021

As large graph processing emerges, we observe a costly fork-processing pattern (FPP) that is common in many graph algorithms. The unique feature of the FPP is that it launches many independent queries from different source vertices on the same graph. For example, an algorithm in analyzing the network community profile can execute Personalized PageRanks that start from tens of thousands of source vertices at the same time. We study the efficiency of handling FPPs in state-of-the-art graph processing systems on multi-core architectures. We find that those systems suffer from severe cache miss penalty because of the irregular and uncoordinated memory accesses in processing FPPs. In this paper, we propose ForkGraph, a cache-efficient FPP processing system on multi-core architectures. To improve the cache reuse, we divide the graph into partitions each sized of LLC capacity, and the queries in an FPP are buffered and executed on the partition basis. We further develop efficient intra- and inter-partition execution strategies for efficiency. For intra-partition processing, since the graph partition fits into LLC, we propose to execute each graph query with efficient sequential algorithms (in contrast with parallel algorithms in existing parallel graph processing systems) and present an atomic-free query processing by consolidating contending operations to cache-resident graph partition. For inter-partition processing, we propose yielding and priority-based scheduling, to reduce redundant work in processing. Besides, we theoretically prove that ForkGraph performs the same amount of work, to within a constant factor, as the fastest known sequential algorithms in FPP queries processing, which is work efficient. Our evaluations on real-world graphs show that ForkGraph significantly outperforms state-of-the-art graph processing systems with two orders of magnitude speedups.

قواعد البيانات النظم الموزعة والتوازية والحوسبة العنقودية

Discovering High Utility-Occupancy Patterns from Uncertain Data

135 - Chien-Ming Chen , Lili Chen , Wensheng Gan 2020

It is widely known that there is a lot of useful information hidden in big data, leading to a new saying that data is money. Thus, it is prevalent for individuals to mine crucial information for utilization in many real-world applications. In the pas t, studies have considered frequency. Unfortunately, doing so neglects other aspects, such as utility, interest, or risk. Thus, it is sensible to discover high-utility itemsets (HUIs) in transaction databases while utilizing not only the quantity but also the predefined utility. To find patterns that can represent the supporting transaction, a recent study was conducted to mine high utility-occupancy patterns whose contribution to the utility of the entire transaction is greater than a certain value. Moreover, in realistic applications, patterns may not exist in transactions but be connected to an existence probability. In this paper, a novel algorithm, called High-Utility-Occupancy Pattern Mining in Uncertain databases (UHUOPM), is proposed. The patterns found by the algorithm are called Potential High Utility Occupancy Patterns (PHUOPs). This algorithm divides user preferences into three factors, including support, probability, and utility occupancy. To reduce memory cost and time consumption and to prune the search space in the algorithm as mentioned above, probability-utility-occupancy list (PUO-list) and probability-frequency-utility table (PFU-table) are used, which assist in providing the downward closure property. Furthermore, an original tree structure, called support count tree (SC-tree), is constructed as the search space of the algorithm. Finally, substantial experiments were conducted to evaluate the performance of proposed UHUOPM algorithm on both real-life and synthetic datasets, particularly in terms of effectiveness and efficiency.

قواعد البيانات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الشام الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Characterizing Discriminative Patterns

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً