Do you want to publish a course? Click here

The Frequent Items Problem in Online Streaming under Various Performance Measures

347   0   0.0 ( 0 )
 Added by Kim S. Larsen
 Publication date 2013
and research's language is English




Ask ChatGPT about the research

In this paper, we strengthen the competitive analysis results obtained for a fundamental online streaming problem, the Frequent Items Problem. Additionally, we contribute with a more detailed analysis of this problem, using alternative performance measures, supplementing the insight gained from competitive analysis. The results also contribute to the general study of performance measures for online algorithms. It has long been known that competitive analysis suffers from drawbacks in certain situations, and many alternative measures have been proposed. However, more systematic comparative studies of performance measures have been initiated recently, and we continue this work, using competitive analysis, relative interval analysis, and relative worst order analysis on the Frequent Items Problem.



rate research

Read More

Frequent Item-set Mining (FIM), sometimes called Market Basket Analysis (MBA) or Association Rule Learning (ARL), are Machine Learning (ML) methods for creating rules from datasets of transactions of items. Most methods identify items likely to appear together in a transaction based on the support (i.e. a minimum number of relative co-occurrence of the items) for that hypothesis. Although this is a good indicator to measure the relevance of the assumption that these items are likely to appear together, the phenomenon of very frequent items, referred to as ubiquitous items, is not addressed in most algorithms. Ubiquitous items have the same entropy as infrequent items, and not contributing significantly to the knowledge. On the other hand, they have strong effect on the performance of the algorithms and sometimes preventing the convergence of the FIM algorithms and thus the provision of meaningful results. This paper discusses the phenomenon of ubiquitous items and demonstrates how ignoring these has a dramatic effect on the computation performances but with a low and controlled effect on the significance of the results.
Though competitive analysis has been a very useful performance measure for the quality of online algorithms, it is recognized that it sometimes fails to distinguish between algorithms of different quality in practice. A number of alternative measures have been proposed, but, with a few exceptions, these have generally been applied only to the online problem they were developed in connection with. Recently, a systematic study of performance measures for online algorithms was initiated [Boyar, Irani, Larsen: Eleventh International Algorithms and Data Structures Symposium 2009], first focusing on a simple server problem. We continue this work by studying a fundamentally different online problem, online search, and the Reservation Price Policies in particular. The purpose of this line of work is to learn more about the applicability of various performance measures in different situations and the properties that the different measures emphasize. We investigate the following analysis techniques: Competitive, Relative Worst Order, Bijective, Average, Relative Interval, Random Order, and Max/Max. In addition to drawing conclusions on this work, we also investigate the measures sensitivity to integral vs. real-valued domains, and as a part of this work, generalize some of the known performance measures. Finally, we have established the first optimality proof for Relative Interval Analysis.
We consider the streaming complexity of a fundamental task in approximate pattern matching: the $k$-mismatch problem. It asks to compute Hamming distances between a pattern of length $n$ and all length-$n$ substrings of a text for which the Hamming distance does not exceed a given threshold $k$. In our problem formulation, we report not only the Hamming distance but also, on demand, the full emph{mismatch information}, that is the list of mismatched pairs of symbols and their indices. The twin challenges of streaming pattern matching derive from the need both to achieve small working space and also to guarantee that every arriving input symbol is processed quickly. We present a streaming algorithm for the $k$-mismatch problem which uses $O(klog{n}logfrac{n}{k})$ bits of space and spends ourcomplexity time on each symbol of the input stream, which consists of the pattern followed by the text. The running time almost matches the classic offline solution and the space usage is within a logarithmic factor of optimal. Our new algorithm therefore effectively resolves and also extends an open problem first posed in FOCS09. En route to this solution, we also give a deterministic $O( k (log frac{n}{k} + log |Sigma|) )$-bit encoding of all the alignments with Hamming distance at most $k$ of a length-$n$ pattern within a text of length $O(n)$. This secondary result provides an optimal solution to a natural communication complexity problem which may be of independent interest.
Given a stream $S = (s_1, s_2, ..., s_N)$, a $phi$-heavy hitter is an item $s_i$ that occurs at least $phi N$ times in $S$. The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a $phi$-event at time $t$ if $s_t$ occurs exactly $phi N$ times in $(s_1, s_2, ..., s_t)$. Thus, for each $phi$-heavy hitter there is a single $phi$-event which occurs when its count reaches the reporting threshold $phi N$. We define the online event-detection problem (OEDP) as: given $phi$ and a stream $S$, report all $phi$-events as soon as they occur. Many real-world monitoring systems demand event detection where all events must be reported (no false negatives), in a timely manner, with no non-events reported (no false positives), and a low reporting threshold. As a result, the OEDP requires a large amount of space (Omega(N) words) and is not solvable in the streaming model or via standard sampling-based approaches. Since OEDP requires large space, we focus on cache-efficient algorithms in the external-memory model. We provide algorithms for the OEDP that are within a log factor of optimal. Our algorithms are tunable: its parameters can be set to allow for a bounded false-positives and a bounded delay in reporting. None of our relaxations allow false negatives since reporting all events is a strict requirement of our applications. Finally, we show improved results when the count of items in the input stream follows a power-law distribution.
We consider the online Min-Sum Set Cover (MSSC), a natural and intriguing generalization of the classical list update problem. In Online MSSC, the algorithm maintains a permutation on $n$ elements based on subsets $S_1, S_2, ldots$ arriving online. The algorithm serves each set $S_t$ upon arrival, using its current permutation $pi_{t}$, incurring an access cost equal to the position of the first element of $S_t$ in $pi_{t}$. Then, the algorithm may update its permutation to $pi_{t+1}$, incurring a moving cost equal to the Kendall tau distance of $pi_{t}$ to $pi_{t+1}$. The objective is to minimize the total access and moving cost for serving the entire sequence. We consider the $r$-uniform version, where each $S_t$ has cardinality $r$. List update is the special case where $r = 1$. We obtain tight bounds on the competitive ratio of deterministic online algorithms for MSSC against a static adversary, that serves the entire sequence by a single permutation. First, we show a lower bound of $(r+1)(1-frac{r}{n+1})$ on the competitive ratio. Then, we consider several natural generalizations of successful list update algorithms and show that they fail to achieve any interesting competitive guarantee. On the positive side, we obtain a $O(r)$-competitive deterministic algorithm using ideas from online learning and the multiplicative weight updates (MWU) algorithm. Furthermore, we consider efficient algorithms. We propose a memoryless online algorithm, called Move-All-Equally, which is inspired by the Double Coverage algorithm for the $k$-server problem. We show that its competitive ratio is $Omega(r^2)$ and $2^{O(sqrt{log n cdot log r})}$, and conjecture that it is $f(r)$-competitive. We also compare Move-All-Equally against the dynamic optimal solution and obtain (almost) tight bounds by showing that it is $Omega(r sqrt{n})$ and $O(r^{3/2} sqrt{n})$-competitive.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا