New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Towards a Workload for Evolutionary Analytics

140 0 0.0 ( 0 )

Download Cite

Added by Jagan Sankaranarayanan

Publication date 2013

fields Informatics Engineering

and research's language is English

Authors Jeff LeFevre - Jagan Sankaranarayanan - Hakan Hacigumus

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identify its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify several metrics to test system support for evolutionary analytics. Along with our metrics, we present methodologies for running the workload that capture this analytical scenario.

rate research

Lachesis: Automatic Partitioning for UDF-Centric Analytics

79 - Jia Zou , Amitabh Das , Pratik Barhate 2020

Persistent partitioning is effective in avoiding expensive shuffling operations. However it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions (UDFs), where sub-computations are hard to be reused for partitionings compared to relational applications. In addition, functional dependency that is widely utilized for partitioning selection is often unavailable in the unstructured data that is ubiquitous in UDF-centric analytics. We propose the Lachesis system, which represents UDF-centric workloads as workflows of analyzable and reusable sub-computations. Lachesis further adopts a deep reinforcement learning model to infer which sub-computations should be used to partition the underlying data. This analysis is then applied to automatically optimize the storage of the data across applications to improve the performance and users productivity.

Databases Distributed Parallel and Cluster Computing

Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

77 - Gregory M. Essertel , Ruby Y. Tahboub , James M. Decker 2017

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an increase in expressiveness and performance. But how good are these extensions at extracting high performance from modern hardware platforms? While Spark has made impressive progress, we show that for relational workloads, there is still a significant gap compared with best-of-breed query engines. And when stepping outside of the relational world, query optimization techniques are ineffective if large parts of a computation have to be treated as user-defined functions (UDFs). We present Flare: a new back-end for Spark that brings performance closer to the best SQL engines, without giving up the added expressiveness of Spark. We demonstrate order of magnitude speedups both for relational workloads such as TPC-H, as well as for a range of machine learning kernels that combine relational and iterative functional processing. Flare achieves these results through (1) compilation to native code, (2) replacing parts of the Spark runtime system, and (3) extending the scope of optimization and code generation to large classes of UDFs.

Databases Distributed Parallel and Cluster Computing Performance

Towards Semantic Big Graph Analytics for Cross-Domain Knowledge Discovery

94 - Feichen Shen 2019

In recent years, the size of big linked data has grown rapidly and this number is still rising. Big linked data and knowledge bases come from different domains such as life sciences, publications, media, social web, and so on. However, with the rapid increasing of data, it is very challenging for people to acquire a comprehensive collection of cross domain knowledge to meet their needs. Under this circumstance, it is extremely difficult for people without expertise to extract knowledge from various domains. Therefore, nowadays human limited knowledge cant feed the high requirement for discovering large amount of cross domain knowledge. In this research, we present a big graph analytics framework aims at addressing this issue by providing semantic methods to facilitate the management of big graph data from close domains in order to discover cross domain knowledge in a more accurate and efficient way.

Databases

Towards Million-Server Network Simulations on Just a Laptop

77 - Maciej Besta , Marcel Schneider , Salvatore Di Girolamo 2021

The growing size of data center and HPC networks pose unprecedented requirements on the scalability of simulation infrastructure. The ability to simulate such large-scale interconnects on a simple PC would facilitate research efforts. Unfortunately, as we first show in this work, existing shared-memory packet-level simulators do not scale to the sizes of the largest networks considered today. We then illustrate a feasibility analysis and a set of enhancements that enable a simple packet-level htsim simulator to scale to the unprecedented simulation sizes on a single PC. Our code is available online and can be used to design novel schemes in the coming era of omnipresent data centers and HPC clusters.

Networking and Internet Architecture Distributed Parallel and Cluster Computing Performance

A workload-adaptive mechanism for linear queries under local differential privacy

91 - Ryan McKenna , Raj Kumar Maity , Arya Mazumdar 2020

We propose a new mechanism to accurately answer a user-provided set of linear counting queries under local differential privacy (LDP). Given a set of linear counting queries (the workload) our mechanism automatically adapts to provide accuracy on the workload queries. We define a parametric class of mechanisms that produce unbiased estimates of the workload, and formulate a constrained optimization problem to select a mechanism from this class that minimizes expected total squared error. We solve this optimization problem numerically using projected gradient descent and provide an efficient implementation that scales to large workloads. We demonstrate the effectiveness of our optimization-based approach in a wide variety of settings, showing that it outperforms many competitors, even outperforming existing mechanisms on the workloads for which they were intended.

Databases Cryptography and Security

comments

Fetching comments

Mamoun Private University For Science and Technology

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Towards a Workload for Evolutionary Analytics

Ask ChatGPT about the research

No Arabic abstract

Read More